House Prices - Kaggle Copetitions

image Introduction:

Develop your skills in data science through a useful and common case to everyone, the forecast of the sale prices of houses. With this premise there is a plethora of available data, where we can highlight the data set of Boston Housing from Harrison and Rubinfeld (1978) as the most famous for beginners and already much explored.

But after so many years would this data set present enough complexity and challenge for students to practice a whole range of new knowledge acquired in their courses in statistics and data science? Would it have a sufficient number of observations and features to make it necessary to analyze outliers, collinearity, multicollinearity, the need for selection and reduction of dimensionality of features, it to not mention its applicability to more modern machine learning techniques and algorithms?

Not in the opinion of Dr Dean De Cock, Professor of Statistics at Truman State University. He was looking for a data set that would allow students the opportunity to display the skills they had learned within the class.

In his quest, he finally found with the Ames City Assessor's Office a potential data set. The initial Excel file contained 113 variables describing 3970 property sales that had occurred in Ames, Iowa between 2006 and 2010. The variables were a mix of nominal, ordinal, continuous, and discrete variables used in calculation of assessed values and included physical property measurements in addition to computation variables used in the city's assessment process. For Dr Dean purposes, a "layman's" data set that could be easily understood by users at all levels was desirable; so He began his project by removing any variables that required special knowledge or previous calculations for their use. Most of these deleted variables were related to weighting and adjustment factors used in the city's current modeling system.

In the end, the selected dataset has 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa. The case was so interesting that it was introduced as one of the competitions for beginners of Kaggle, quickly becoming one of the most popular. If you interest to get some insights and suggestions directly from Dr Dean before start, I recommend that you read his paper.

With this spirit, I create this material, not with the purpose of competing, but of intended to cover most of the techniques of data analysis in Python for regression analysis. That is why it follows the natural flow of ML and contains many texts and links to the techniques, made your conference and references easy. as it can be extended over time.

In order not to fall into monotony, sometimes I take some liberties and apply a little humor, but nothing that compromises the accuracy of knowledge that is being acquired by the reader. Of course, anyone who wants to contribute some addition or even a claim or choreography will be very welcome. But what should you be thinking now? So, enough of tangle and let's go. image Competition Description:

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition from Kaggle challenges you to predict the final price of each home.

Practice Skills

  • Creative feature engineering
  • Advanced regression techniques like random forest and gradient boosting

For a detailed description of the data set, click here.

Table of Contents

Preparing environment and uploading data

Import Packages

In [1]:
import os
import warnings
warnings.simplefilter(action = 'ignore', category=FutureWarning)
warnings.filterwarnings('ignore')
def ignore_warn(*args, **kwargs):
    pass

warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)

import numpy as np
import pandas as pd
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Limiting floats output to 3 decimal points
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style="ticks", color_codes=True, font_scale=1.5)
color = sns.color_palette()
sns.set_style('darkgrid')
import mpl_toolkits
from mpl_toolkits.mplot3d import Axes3D
import pylab 

from scipy import stats
from scipy.stats import skew, norm, probplot, boxcox
from scipy.special import boxcox1p
from patsy import dmatrices
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import RobustScaler, PolynomialFeatures, StandardScaler, LabelEncoder
from sklearn.feature_selection import f_regression, mutual_info_regression, SelectKBest, RFECV, SelectFromModel
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
from sklearn.feature_extraction import FeatureHasher
from sklearn.decomposition import PCA, KernelPCA
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import LinearRegression, OrthogonalMatchingPursuit, Lasso, LassoLarsIC, ElasticNet, ElasticNetCV
from sklearn.linear_model import ARDRegression, SGDRegressor, PassiveAggressiveRegressor, HuberRegressor, BayesianRidge
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, BaggingRegressor, ExtraTreesRegressor
import xgboost as xgb
from xgboost import XGBRegressor, plot_importance
import lightgbm as lgb

from sklearn.model_selection import GridSearchCV, cross_val_score, KFold, cross_val_predict, train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

Load Datasets

In [2]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

#Save the 'Id' column
train_ID = train['Id']
test_ID = test['Id']

#Now drop the  'Id' colum since it's unnecessary for  the prediction process.
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)

train.rename(columns={'3SsnPorch':'TSsnPorch'}, inplace=True)
test.rename(columns={'3SsnPorch':'TSsnPorch'}, inplace=True)

test['SalePrice'] = 0

Exploratory Data Analysis (EDA)

image

Take a First Look of our Data:

I created the function below to simplify the analysis of general characteristics of the data. Inspired on the str function of R, this function returns the types, counts, distinct, count nulls, missing ratio and uniques values of each field/feature.

If the study involve some supervised learning, this function can return the study of the correlation, for this we just need provide the dependent variable to the pred parameter.

Also, if its return is stored in a variable you can evaluate it in more detail, focus on specific field, or sort them from different perspectives.

In [3]:
def rstr(df, pred=None): 
    obs = df.shape[0]
    types = df.dtypes
    counts = df.apply(lambda x: x.count())
    uniques = df.apply(lambda x: [x.unique()])
    nulls = df.apply(lambda x: x.isnull().sum())
    distincts = df.apply(lambda x: x.unique().shape[0])
    missing_ration = (df.isnull().sum()/ obs) * 100
    skewness = df.skew()
    kurtosis = df.kurt() 
    print('Data shape:', df.shape)
    
    if pred is None:
        cols = ['types', 'counts', 'distincts', 'nulls', 'missing ration', 'uniques', 'skewness', 'kurtosis']
        str = pd.concat([types, counts, distincts, nulls, missing_ration, uniques, skewness, kurtosis], axis = 1)

    else:
        corr = df.corr()[pred]
        str = pd.concat([types, counts, distincts, nulls, missing_ration, uniques, skewness, kurtosis, corr], axis = 1, sort=False)
        corr_col = 'corr '  + pred
        cols = ['types', 'counts', 'distincts', 'nulls', 'missing_ration', 'uniques', 'skewness', 'kurtosis', corr_col ]
    
    str.columns = cols
    dtypes = str.types.value_counts()
    print('___________________________\nData types:\n',str.types.value_counts())
    print('___________________________')
    return str
In [4]:
details = rstr(train, 'SalePrice')
display(details.sort_values(by='corr SalePrice', ascending=False))
Data shape: (1460, 80)
___________________________
Data types:
 object     43
int64      34
float64     3
Name: types, dtype: int64
___________________________
types counts distincts nulls missing_ration uniques skewness kurtosis corr SalePrice
SalePrice int64 1460 663 0 0.000 [[208500, 181500, 223500, 140000, 250000, 1430... 1.883 6.536 1.000
OverallQual int64 1460 10 0 0.000 [[7, 6, 8, 5, 9, 4, 10, 3, 1, 2]] 0.217 0.096 0.791
GrLivArea int64 1460 861 0 0.000 [[1710, 1262, 1786, 1717, 2198, 1362, 1694, 20... 1.367 4.895 0.709
GarageCars int64 1460 5 0 0.000 [[2, 3, 1, 0, 4]] -0.343 0.221 0.640
GarageArea int64 1460 441 0 0.000 [[548, 460, 608, 642, 836, 480, 636, 484, 468,... 0.180 0.917 0.623
TotalBsmtSF int64 1460 721 0 0.000 [[856, 1262, 920, 756, 1145, 796, 1686, 1107, ... 1.524 13.250 0.614
1stFlrSF int64 1460 753 0 0.000 [[856, 1262, 920, 961, 1145, 796, 1694, 1107, ... 1.377 5.746 0.606
FullBath int64 1460 4 0 0.000 [[2, 1, 3, 0]] 0.037 -0.857 0.561
TotRmsAbvGrd int64 1460 12 0 0.000 [[8, 6, 7, 9, 5, 11, 4, 10, 12, 3, 2, 14]] 0.676 0.881 0.534
YearBuilt int64 1460 112 0 0.000 [[2003, 1976, 2001, 1915, 2000, 1993, 2004, 19... -0.613 -0.440 0.523
YearRemodAdd int64 1460 61 0 0.000 [[2003, 1976, 2002, 1970, 2000, 1995, 2005, 19... -0.504 -1.272 0.507
GarageYrBlt float64 1379 98 81 5.548 [[2003.0, 1976.0, 2001.0, 1998.0, 2000.0, 1993... -0.649 -0.418 0.486
MasVnrArea float64 1452 328 8 0.548 [[196.0, 0.0, 162.0, 350.0, 186.0, 240.0, 286.... 2.669 10.082 0.477
Fireplaces int64 1460 4 0 0.000 [[0, 1, 2, 3]] 0.650 -0.217 0.467
BsmtFinSF1 int64 1460 637 0 0.000 [[706, 978, 486, 216, 655, 732, 1369, 859, 0, ... 1.686 11.118 0.386
LotFrontage float64 1201 111 259 17.740 [[65.0, 80.0, 68.0, 60.0, 84.0, 85.0, 75.0, na... 2.164 17.453 0.352
WoodDeckSF int64 1460 274 0 0.000 [[0, 298, 192, 40, 255, 235, 90, 147, 140, 160... 1.541 2.993 0.324
2ndFlrSF int64 1460 417 0 0.000 [[854, 0, 866, 756, 1053, 566, 983, 752, 1142,... 0.813 -0.553 0.319
OpenPorchSF int64 1460 202 0 0.000 [[61, 0, 42, 35, 84, 30, 57, 204, 4, 21, 33, 2... 2.364 8.490 0.316
HalfBath int64 1460 3 0 0.000 [[1, 0, 2]] 0.676 -1.077 0.284
LotArea int64 1460 1073 0 0.000 [[8450, 9600, 11250, 9550, 14260, 14115, 10084... 12.208 203.243 0.264
BsmtFullBath int64 1460 4 0 0.000 [[1, 0, 2, 3]] 0.596 -0.839 0.227
BsmtUnfSF int64 1460 780 0 0.000 [[150, 284, 434, 540, 490, 64, 317, 216, 952, ... 0.920 0.475 0.214
BedroomAbvGr int64 1460 8 0 0.000 [[3, 4, 1, 2, 0, 5, 6, 8]] 0.212 2.231 0.168
ScreenPorch int64 1460 76 0 0.000 [[0, 176, 198, 291, 252, 99, 184, 168, 130, 14... 4.122 18.439 0.111
PoolArea int64 1460 8 0 0.000 [[0, 512, 648, 576, 555, 480, 519, 738]] 14.828 223.268 0.092
MoSold int64 1460 12 0 0.000 [[2, 5, 9, 12, 10, 8, 11, 4, 1, 7, 3, 6]] 0.212 -0.404 0.046
TSsnPorch int64 1460 20 0 0.000 [[0, 320, 407, 130, 180, 168, 140, 508, 238, 2... 10.304 123.662 0.045
BsmtFinSF2 int64 1460 144 0 0.000 [[0, 32, 668, 486, 93, 491, 506, 712, 362, 41,... 4.255 20.113 -0.011
BsmtHalfBath int64 1460 3 0 0.000 [[0, 1, 2]] 4.103 16.397 -0.017
... ... ... ... ... ... ... ... ... ...
RoofStyle object 1460 6 0 0.000 [[Gable, Hip, Gambrel, Mansard, Flat, Shed]] nan nan nan
RoofMatl object 1460 8 0 0.000 [[CompShg, WdShngl, Metal, WdShake, Membran, T... nan nan nan
Exterior1st object 1460 15 0 0.000 [[VinylSd, MetalSd, Wd Sdng, HdBoard, BrkFace,... nan nan nan
Exterior2nd object 1460 16 0 0.000 [[VinylSd, MetalSd, Wd Shng, HdBoard, Plywood,... nan nan nan
MasVnrType object 1452 5 8 0.548 [[BrkFace, None, Stone, BrkCmn, nan]] nan nan nan
ExterQual object 1460 4 0 0.000 [[Gd, TA, Ex, Fa]] nan nan nan
ExterCond object 1460 5 0 0.000 [[TA, Gd, Fa, Po, Ex]] nan nan nan
Foundation object 1460 6 0 0.000 [[PConc, CBlock, BrkTil, Wood, Slab, Stone]] nan nan nan
BsmtQual object 1423 5 37 2.534 [[Gd, TA, Ex, nan, Fa]] nan nan nan
BsmtCond object 1423 5 37 2.534 [[TA, Gd, nan, Fa, Po]] nan nan nan
BsmtExposure object 1422 5 38 2.603 [[No, Gd, Mn, Av, nan]] nan nan nan
BsmtFinType1 object 1423 7 37 2.534 [[GLQ, ALQ, Unf, Rec, BLQ, nan, LwQ]] nan nan nan
BsmtFinType2 object 1422 7 38 2.603 [[Unf, BLQ, nan, ALQ, Rec, LwQ, GLQ]] nan nan nan
Heating object 1460 6 0 0.000 [[GasA, GasW, Grav, Wall, OthW, Floor]] nan nan nan
HeatingQC object 1460 5 0 0.000 [[Ex, Gd, TA, Fa, Po]] nan nan nan
CentralAir object 1460 2 0 0.000 [[Y, N]] nan nan nan
Electrical object 1459 6 1 0.068 [[SBrkr, FuseF, FuseA, FuseP, Mix, nan]] nan nan nan
KitchenQual object 1460 4 0 0.000 [[Gd, TA, Ex, Fa]] nan nan nan
Functional object 1460 7 0 0.000 [[Typ, Min1, Maj1, Min2, Mod, Maj2, Sev]] nan nan nan
FireplaceQu object 770 6 690 47.260 [[nan, TA, Gd, Fa, Ex, Po]] nan nan nan
GarageType object 1379 7 81 5.548 [[Attchd, Detchd, BuiltIn, CarPort, nan, Basme... nan nan nan
GarageFinish object 1379 4 81 5.548 [[RFn, Unf, Fin, nan]] nan nan nan
GarageQual object 1379 6 81 5.548 [[TA, Fa, Gd, nan, Ex, Po]] nan nan nan
GarageCond object 1379 6 81 5.548 [[TA, Fa, nan, Gd, Po, Ex]] nan nan nan
PavedDrive object 1460 3 0 0.000 [[Y, N, P]] nan nan nan
PoolQC object 7 4 1453 99.521 [[nan, Ex, Fa, Gd]] nan nan nan
Fence object 281 5 1179 80.753 [[nan, MnPrv, GdWo, GdPrv, MnWw]] nan nan nan
MiscFeature object 54 5 1406 96.301 [[nan, Shed, Gar2, Othr, TenC]] nan nan nan
SaleType object 1460 9 0 0.000 [[WD, New, COD, ConLD, ConLI, CWD, ConLw, Con,... nan nan nan
SaleCondition object 1460 6 0 0.000 [[Normal, Abnorml, Partial, AdjLand, Alloca, F... nan nan nan

80 rows × 9 columns

Some Observations from the STR Details:

image

  • The dependent variabel, SalePrice, are skewed and heavy-tailed distribution. We need investigate its distribution with a plot and check if a transformation by Log 1P could correct it, withou drop most of the outiliers.

  • Nulls: The data have 19 features with nulls, five of then area categorical and with more then 47% of missing ration. They are candidates to drop or use them to create another more interesting feature:

    • PoolQC
    • MiscFeature
    • Alley
    • Fence
    • FireplaceQu

  • Features high skewed right, heavy-tailed distribution, and with high correlation to Sales Price. It is important to treat them (boxcox 1p transformation, Robustscaler, and drop some outliers):
    • TotalBsmtSF
    • 1stFlrSF
    • GrLivArea

  • Features skewed, heavy-tailed distribution, and with good correlation to Sales Price. It is important to treat them (boxcox 1p transformation, Robustscaler, and drop some outliers):
    • LotArea
    • KitchenAbvGr
    • ScreenPorch
    • EnclosedPorch
    • MasVnrArea
    • OpenPorchSF
    • LotFrontage
    • BsmtFinSF1
    • WoodDeckSF
    • MSSubClass

  • Features high skewed, heavy-tailed distribution, and with low correlation to Sales Price. Maybe we can drop these features, or just use they with other to create a new more importants feature:
    • MiscVal
    • TSsnPorch
    • LowQualFinSF
    • BsmtFinSF2
    • BsmtHalfBa

  • Features low skewed, and with good to low correlation to Sales Price. Just use a Robustscaler probably reduce the few distorcions:
    • BsmtUnfSF
    • 2ndFlrSF
    • TotRmsAbvGrd
    • HalfBath
    • Fireplaces
    • BsmtFullBath
    • OverallQual
    • BedroomAbvGr
    • GarageArea
    • FullBath
    • GarageCars
    • OverallCond

  • Transforme from Yaer Feature to Age, 2011 - Year feature, or YEAR(TODAY()) - Year Feature
    • YearRemodAdd:
    • YearBuilt
    • GarageYrBlt
    • YrSold

If we apply this data to a Keras, first we need to chnage the float64 and Int64 to float32 and Int32!

First see of some stats of Numeric Data

So, for the main statistics of our numeric data describe the function (like the summary of R)

In [5]:
display(train.describe().transpose())
count mean std min 25% 50% 75% max
MSSubClass 1460.000 56.897 42.301 20.000 20.000 50.000 70.000 190.000
LotFrontage 1201.000 70.050 24.285 21.000 59.000 69.000 80.000 313.000
LotArea 1460.000 10516.828 9981.265 1300.000 7553.500 9478.500 11601.500 215245.000
OverallQual 1460.000 6.099 1.383 1.000 5.000 6.000 7.000 10.000
OverallCond 1460.000 5.575 1.113 1.000 5.000 5.000 6.000 9.000
YearBuilt 1460.000 1971.268 30.203 1872.000 1954.000 1973.000 2000.000 2010.000
YearRemodAdd 1460.000 1984.866 20.645 1950.000 1967.000 1994.000 2004.000 2010.000
MasVnrArea 1452.000 103.685 181.066 0.000 0.000 0.000 166.000 1600.000
BsmtFinSF1 1460.000 443.640 456.098 0.000 0.000 383.500 712.250 5644.000
BsmtFinSF2 1460.000 46.549 161.319 0.000 0.000 0.000 0.000 1474.000
BsmtUnfSF 1460.000 567.240 441.867 0.000 223.000 477.500 808.000 2336.000
TotalBsmtSF 1460.000 1057.429 438.705 0.000 795.750 991.500 1298.250 6110.000
1stFlrSF 1460.000 1162.627 386.588 334.000 882.000 1087.000 1391.250 4692.000
2ndFlrSF 1460.000 346.992 436.528 0.000 0.000 0.000 728.000 2065.000
LowQualFinSF 1460.000 5.845 48.623 0.000 0.000 0.000 0.000 572.000
GrLivArea 1460.000 1515.464 525.480 334.000 1129.500 1464.000 1776.750 5642.000
BsmtFullBath 1460.000 0.425 0.519 0.000 0.000 0.000 1.000 3.000
BsmtHalfBath 1460.000 0.058 0.239 0.000 0.000 0.000 0.000 2.000
FullBath 1460.000 1.565 0.551 0.000 1.000 2.000 2.000 3.000
HalfBath 1460.000 0.383 0.503 0.000 0.000 0.000 1.000 2.000
BedroomAbvGr 1460.000 2.866 0.816 0.000 2.000 3.000 3.000 8.000
KitchenAbvGr 1460.000 1.047 0.220 0.000 1.000 1.000 1.000 3.000
TotRmsAbvGrd 1460.000 6.518 1.625 2.000 5.000 6.000 7.000 14.000
Fireplaces 1460.000 0.613 0.645 0.000 0.000 1.000 1.000 3.000
GarageYrBlt 1379.000 1978.506 24.690 1900.000 1961.000 1980.000 2002.000 2010.000
GarageCars 1460.000 1.767 0.747 0.000 1.000 2.000 2.000 4.000
GarageArea 1460.000 472.980 213.805 0.000 334.500 480.000 576.000 1418.000
WoodDeckSF 1460.000 94.245 125.339 0.000 0.000 0.000 168.000 857.000
OpenPorchSF 1460.000 46.660 66.256 0.000 0.000 25.000 68.000 547.000
EnclosedPorch 1460.000 21.954 61.119 0.000 0.000 0.000 0.000 552.000
TSsnPorch 1460.000 3.410 29.317 0.000 0.000 0.000 0.000 508.000
ScreenPorch 1460.000 15.061 55.757 0.000 0.000 0.000 0.000 480.000
PoolArea 1460.000 2.759 40.177 0.000 0.000 0.000 0.000 738.000
MiscVal 1460.000 43.489 496.123 0.000 0.000 0.000 0.000 15500.000
MoSold 1460.000 6.322 2.704 1.000 5.000 6.000 8.000 12.000
YrSold 1460.000 2007.816 1.328 2006.000 2007.000 2008.000 2009.000 2010.000
SalePrice 1460.000 180921.196 79442.503 34900.000 129975.000 163000.000 214000.000 755000.000

Overall Quality

It is not surprise that overall quality has the highest correlation with SalePrice among the numeric variables (0.79). It rates the overall material and finish of the house on a scale from 1 (very poor) to 10 (very excellent). The positive correlation is certainly there indeed, and seems to be a slightly upward curve. Regarding outliers, I do not see any extreme values. If there is a candidate to take out as an outlier later on, it seems to be the expensive house with grade 4.

Especially the two houses with really big living areas and low SalePrices seem outliers. I will not take them out yet, as taking outliers can be dangerous. For instance, a low score on the Overall Quality could explain a low price. However, as you can see below, these two houses actually also score maximum points on Overall Quality. Therefore, I will keep theses houses in mind as prime candidates to take out as outliers. quality

In [6]:
fig = plt.figure(figsize=(20, 15))
sns.set(font_scale=1.5)

# (Corr= 0.817185) Box plot overallqual/salePrice
fig1 = fig.add_subplot(221); sns.boxplot(x='OverallQual', y='SalePrice', data=train[['SalePrice', 'OverallQual']])

# (Corr= 0.700927) GrLivArea vs SalePrice plot
fig2 = fig.add_subplot(222); 
sns.scatterplot(x = train.GrLivArea, y = train.SalePrice, hue=train.OverallQual, palette= 'Spectral')

# (Corr= 0.680625) GarageCars vs SalePrice plot
fig3 = fig.add_subplot(223); 
sns.scatterplot(x = train.GarageCars, y = train.SalePrice, hue=train.OverallQual, palette= 'Spectral')

# (Corr= 0.650888) GarageArea vs SalePrice plot
fig4 = fig.add_subplot(224); 
sns.scatterplot(x = train.GarageArea, y = train.SalePrice, hue=train.OverallQual, palette= 'Spectral')

fig5 = plt.figure(figsize=(16, 8))
fig6 = fig5.add_subplot(121); 
sns.scatterplot(y = train.SalePrice , x = train.TotalBsmtSF, hue=train.OverallQual, palette= 'YlOrRd')

fig7 = fig5.add_subplot(122); 
sns.scatterplot(y = train.SalePrice, x = train['1stFlrSF'], hue=train.OverallQual, palette= 'YlOrRd')

plt.tight_layout(); plt.show()
In [7]:
fig = plt.figure(figsize=(20,5))
ax = fig.add_subplot(121)
sns.scatterplot(x = train.GrLivArea, y = train.SalePrice, ax = ax)

#Deleting outliers
train = train.drop(train[(train.GrLivArea>4000) & (train.SalePrice<300000)].index)

#Check the graphic again
ax = fig.add_subplot(122)
sns.scatterplot(x =train.GrLivArea, y = train.SalePrice, ax = ax)
plt.show()

Total Rooms above Ground and Living Area

From a previews experience with Boston data set, you probably main expect to much from the total rooms above ground, as its 'RM' feature (the average number of rooms per dwelling), but here is not the same scenario. Our common sense make to think that live area maybe has some correlation to it and probably we can combine this two features to produce a better predictor. Let's see. image

In [8]:
sns.reset_defaults()
sns.set(style="ticks", color_codes=True)

df = train[['SalePrice', 'GrLivArea', 'TotRmsAbvGrd']]
df['GrLivAreaByRms'] = train.GrLivArea/train.TotRmsAbvGrd
df['GrLivArea_x_Rms'] = train.GrLivArea*train.TotRmsAbvGrd
fig = plt.figure(figsize=(20,5))
fig1 = fig.add_subplot(121); sns.regplot(x='GrLivAreaByRms', y='SalePrice', data=df);
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.GrLivAreaByRms.corr(df['SalePrice'])))
fig2 = fig.add_subplot(122); sns.regplot(x='GrLivArea_x_Rms', y='SalePrice', data=df); plt.legend(['Outliers'])
plt.text(x=30000, y=100000, s='Correlation with SalePrice: {:1.4f}'.format(df.GrLivArea_x_Rms.corr(df['SalePrice'])))

print('                                                                  Outliers:',(df.GrLivArea_x_Rms>=45000).sum())
df = df.loc[df.GrLivArea_x_Rms<45000]
sns.regplot(x='GrLivArea_x_Rms', y='SalePrice', data=df); 
plt.title('Living Area has beter correlation ({:1.2f}) than It Multply by Rooms!'.format(df.GrLivArea.corr(df.SalePrice)))
plt.text(x=30000, y=50000, s='Correlation withou Outliers: {:1.4f}'.format(df.GrLivArea_x_Rms.corr(df['SalePrice'])))
plt.show()
del df
                                                                  Outliers: 1

As we can see, the interaction between the two features did not present a better correlation than that already seen in the living area, include it improves to 0.74 with the cut of the outliers.

On the other hand, the multiplication not only demonstrated the living area outliers already identified, but it still emphasized another. If the strategy is to drop the TotRmsAbvGrd, we should also exclude this additional outlier.

In [9]:
train = train[train.GrLivArea * train.TotRmsAbvGrd < 45000]
print('Train observations after remove outliers:',train.shape[0])
Train observations after remove outliers: 1457

Garage areas and parking

From the boxplot below, we can note that more than 3 parking cars and more than 900 of area are outliers, since a few number of their observations. Although there is a relationship between them, most likely with a smaller number of parking spaces, there may be more garage area for other purposes, reason why the correlation between them is 0.88 and not 1. image

In [10]:
fig = plt.figure(figsize=(20,5))
fig1 = fig.add_subplot(131); sns.boxplot(train.GarageCars)
fig2 = fig.add_subplot(132); sns.boxplot(train.GarageArea)
fig3 = fig.add_subplot(133); sns.boxplot(train.GarageCars, train.GarageArea)
plt.show()
In [11]:
df = train[['SalePrice', 'GarageArea', 'GarageCars']]
df['GarageAreaByCar'] = train.GarageArea/train.GarageCars
df['GarageArea_x_Car'] = train.GarageArea*train.GarageCars

fig = plt.figure(figsize=(20,5))
fig1 = fig.add_subplot(121); sns.regplot(x='GarageAreaByCar', y='SalePrice', data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.GarageAreaByCar.corr(df['SalePrice'])))

fig2 = fig.add_subplot(122); sns.regplot(x='GarageArea_x_Car', y='SalePrice', data=df); plt.legend(['Outliers'])
plt.text(x=-100, y=750000, s='Correlation with SalePrice: {:6.4f}'.format(df.GarageArea_x_Car.corr(df['SalePrice'])))
print('                                                                 Outliers:',(df.GarageArea_x_Car>=3700).sum())
df = df.loc[df.GarageArea_x_Car<3700]
sns.regplot(x='GarageArea_x_Car', y='SalePrice', data=df); plt.title('Garage Area Multiply By Cars is the best!')
plt.text(x=-100, y=700000, s='Correlation withou Outliers: {:6.4f}'.format(df.GarageArea_x_Car.corr(df['SalePrice'])))
plt.show()
del df
                                                                 Outliers: 4

As can be seen the area by car is little useful, but contrary to common sense the multiplication of the area by the number of vacancies yes is. In the division we lose the magnitude and we have to maintain one or another functionality to recover it. With the multiplication we solve the problem of 1 parking space of 10 square feet against another of 10 with 1 square feet each. We could still improve the correlation by 0.06, already considering the exclusion of only 4 outliers.

The identification of the outliers was facilitated, note that before we would have a greater number of outliers, since the respective of each features alone are not coincident. garage So let's continue with the multiplication strategy, remove the two original metrics that have high correlation with each other, and exclude the 4 outliers from the training base.

In [12]:
train = train[train.GarageArea * train.GarageCars < 3700]
print('Total observatiosn after outliers cut:', train.shape[0])
Total observatiosn after outliers cut: 1453

Total Basement Area Vs 1st Flor Area

In our country it is not common to have Basement, I think we thought it was a little spooky. So I looked a bit more "carefully" at this variable... image I noticed that in Ames has a lot of variation, but the predictive effect is very small, so I decided to study its composition with the first floor.

In [13]:
df = train[['SalePrice', 'TotalBsmtSF', '1stFlrSF']]
df['TotalBsmtSFByBms'] = train.TotalBsmtSF/train['1stFlrSF']
df['TotalBsmtSF_x_Bsm'] = train.TotalBsmtSF*train['1stFlrSF']
fig = plt.figure(figsize=(20,5))
fig1 = fig.add_subplot(121); sns.regplot(x='TotalBsmtSFByBms', y='SalePrice', data=df);
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.TotalBsmtSFByBms.corr(df['SalePrice'])))
fig2 = fig.add_subplot(122); sns.regplot(x='TotalBsmtSF_x_Bsm', y='SalePrice', data=df); plt.legend(['Outliers'])
plt.text(x=7e06, y=90000, s='Correlation with SalePrice: {:1.4f}'.format(df.TotalBsmtSF_x_Bsm.corr(df['SalePrice'])))

print('                                                             Outliers:',(df.TotalBsmtSF_x_Bsm>=0.9e07).sum())
df = df.loc[df.TotalBsmtSF_x_Bsm<0.9e07]
sns.regplot(x='TotalBsmtSF_x_Bsm', y='SalePrice', data=df); 
plt.title('The multiplacation is better than total basement correlation ({:1.2f}) after outliers cut!'.format(df.TotalBsmtSF.corr(df.SalePrice)))
plt.text(x=7e06, y=50000, s='Correlation withou Outliers: {:1.4f}'.format(df.TotalBsmtSF_x_Bsm.corr(df['SalePrice'])))
plt.show()
del df
                                                             Outliers: 1

image Similar to what we saw in the garage analysis, we again have a better correlation by multiplying the variables, but now we don't have a significant gain with outliers exclusion. So let's continue with the multiplication strategy and remove only the two original metrics that have high correlation with each other.

Year Built Vs Garage Year Built

Of course when we buy a property the date of its construction makes a lot of difference as it can be a source of great headaches. Depending on the age and conditions there will be need for renovations and very old houses there may be cases where the garage has been built or refit after the house itself. image Well, I'd be more worried about the plumbing, the electricity, ... the garage is only for car and trunk, or is it not? Is that so? it will be?

So, let's see the graphs below, and confirm that this two features are highly correlated, but as expect is not easy to find a good substitute by iteration.

In [14]:
df = train[['SalePrice', 'YearBuilt', 'GarageYrBlt']]
df['YearBuilt_x_Garage'] = train.YearBuilt*train.GarageYrBlt
df['Garage_Newest'] = train.YearBuilt < train.GarageYrBlt

fig = plt.figure(figsize=(20,5))
fig1 = fig.add_subplot(121); sns.scatterplot(y = df.SalePrice, x = df.YearBuilt, hue=df.GarageYrBlt, palette= 'YlOrRd')
fig2 = fig.add_subplot(122); sns.regplot(x='YearBuilt_x_Garage', y='SalePrice', data=df); 
plt.text(x=3700000, y=600000, s='YearBuilt Correlation with SalePrice: {:1.4f}'.format(df.YearBuilt.corr(df['SalePrice'])))
plt.text(x=3700000, y=550000, s='Correlation with SalePrice: {:1.4f}'.format(df.YearBuilt_x_Garage.corr(df['SalePrice'])))
plt.show()

However, by making the year of construction of the garage an indicator of whether it is newer, it becomes easiest to identify a pattern of separation.

And more, note that we have a rising price due to the lower age. Maybe the old cars had the garage would only be for themselves... image ..., or put it in the barn. Today we must have other more usable uses for garage, right...? image

In [15]:
sns.lmplot(y = 'SalePrice', x = 'YearBuilt', data=df, markers='.', 
           aspect=1.4, height=4, hue= 'Garage_Newest', palette= 'YlOrRd')
plt.show();  
del df

But note that although we have a rising price the newer the house, the growth rate is very smooth, even with the rate gain with a newer garage. This makes sense, given that the prices of these regressors are meeting with the mean price of each year.

Bathrooms Features

It's time to take a break and go to the toilet, to our luck there are 4 bathroom variables in our data set. FullBath has the largest correlation with SalePrice between than. The others individually, these features are not very important.

In [16]:
fig = plt.figure(figsize=(20,10))
fig1 = fig.add_subplot(221); sns.regplot(x='FullBath', y='SalePrice', data=train)
plt.title('Correlation with SalePrice: {:6.4f}'.format(train.FullBath.corr(train['SalePrice'])))

fig2 = fig.add_subplot(222); sns.regplot(x='HalfBath', y='SalePrice', data=train);
plt.title('Correlation with SalePrice: {:6.4f}'.format(train.HalfBath.corr(train['SalePrice'])))

fig3 = fig.add_subplot(223); sns.regplot(x='BsmtFullBath', y='SalePrice', data=train)
plt.title('Correlation with SalePrice: {:6.4f}'.format(train.BsmtFullBath.corr(train['SalePrice'])))

fig4 = fig.add_subplot(224); sns.regplot(x='BsmtHalfBath', y='SalePrice', data=train);
plt.title('Correlation with SalePrice: {:6.4f}'.format(train.HalfBath.corr(train['SalePrice'])))

plt.show()

image However, I assume that I if I add them up into one predictor, this predictor is likely to become a strong one. A half-bath, also known as a powder room or guest bath, has only two of the four main bathroom components-typically a toilet and sink. Consequently, I will also count the half bathrooms as half.

In [17]:
df = train[['SalePrice', 'FullBath', 'HalfBath', 'BsmtFullBath', 'BsmtHalfBath']]
df['TotBathrooms'] = df.FullBath + (df.HalfBath*0.5) + df.BsmtFullBath + (df.BsmtHalfBath*0.5)
In [18]:
fig = plt.figure(figsize=(10,5))
sns.regplot(x='TotBathrooms', y='SalePrice', data=df); plt.legend(['Outliers'])
plt.text(x=1, y=680000, s='Correlation with SalePrice: {:6.4f}'.format(df.TotBathrooms.corr(df['SalePrice'])))
print('                                                                 Outliers:',(df.TotBathrooms>=5).sum())
df = df.loc[df.TotBathrooms<5]
sns.regplot(x='TotBathrooms', y='SalePrice', data=df); plt.title('Cut Total Bathrooms Outliers is the best!')
plt.text(x=1, y=630000, s='Correlation withou Outliers: {:6.4f}'.format(df.TotBathrooms.corr(df['SalePrice'])))
plt.show()
                                                                 Outliers: 2

So, with our best predictor, we can cut only two outliers, use it and substitute all others bath features with a existence indicator.

In [19]:
train = train[(train.FullBath + (train.HalfBath*0.5) + train.BsmtFullBath + (train.BsmtHalfBath*0.5))<5]
print('Data observations after outliers deletion:', train.shape[0])
Data observations after outliers deletion: 1451

Reviwe Porch Features:

The porch is where many people feel more comfortable to watch life go by, or you prefer the sofa in front of the TV, I think there are people that solved this to the family don't can fighting about this... image ... this idea should make a house worth more, should not it?

In [20]:
def PorchPlots():
    fig = plt.figure(figsize=(20,10))
    fig1 = fig.add_subplot(231); sns.regplot(x='OpenPorchSF', y='SalePrice', data=df)
    plt.title('Correlation with SalePrice: {:6.4f}'.format(df.OpenPorchSF.corr(df['SalePrice'])))

    fig2 = fig.add_subplot(232); sns.regplot(x='EnclosedPorch', y='SalePrice', data=df);
    plt.title('Correlation with SalePrice: {:6.4f}'.format(df.EnclosedPorch.corr(df['SalePrice'])))

    fig3 = fig.add_subplot(233); sns.regplot(x='TSsnPorch', y='SalePrice', data=df)
    plt.title('Correlation with SalePrice: {:6.4f}'.format(df.TSsnPorch.corr(df['SalePrice'])))

    fig4 = fig.add_subplot(234); sns.regplot(x='ScreenPorch', y='SalePrice', data=df);
    plt.title('Correlation with SalePrice: {:6.4f}'.format(df.ScreenPorch.corr(df['SalePrice'])))

    fig5 = fig.add_subplot(235); sns.regplot(x='WoodDeckSF', y='SalePrice', data=df);
    plt.title('Correlation with SalePrice: {:6.4f}'.format(df.WoodDeckSF.corr(df['SalePrice'])))

    fig6 = fig.add_subplot(236); sns.regplot(x='TotalPorchSF', y='SalePrice', data=df);
    plt.title('Correlation with SalePrice: {:6.4f}'.format(df.TotalPorchSF.corr(df['SalePrice'])))

    plt.show()

df = train[['SalePrice', 'OpenPorchSF', 'EnclosedPorch', 'TSsnPorch', 'ScreenPorch', 'WoodDeckSF']]
df['TotalPorchSF'] = df.OpenPorchSF + df.EnclosedPorch + df.TSsnPorch + df.ScreenPorch + df.WoodDeckSF
#df = df[df.TotalPorchSF<=600] # A possible outlier cut!
PorchPlots()
In [21]:
df.OpenPorchSF = df.OpenPorchSF > 0
df.EnclosedPorch =  df.EnclosedPorch > 0
df.TSsnPorch = df.TSsnPorch > 0
df.ScreenPorch = df.ScreenPorch > 0
df.WoodDeckSF = df.WoodDeckSF > 0
df.TotalPorchSF = np.sqrt(df.TotalPorchSF) * (np.log1p(np.sqrt(df.TotalPorchSF))**2)

PorchPlots()

As we have seen, porch features have low correlation with price, and by the graphics we see all most has low bas and high variance, being a high risk to end complex models and fall into ouverfit.

Slope of property and Lot area

Everyone knows that the size of the lot matters, but has anyone seen any ad talking about the slope? image

In [22]:
# LandSlope: Slope of property
LandSlope = {}
LandSlope['Gtl'] = 3 #'Gentle slope'
LandSlope['Mod'] = 2 #'Moderate Slope'
LandSlope['Sev'] = 1 #'Severe Slope'

df = train[['SalePrice', 'LandSlope', 'LotArea']]
df.LandSlope = df.LandSlope.map(LandSlope)
df['LotAreaMultSlope'] = (df.LotArea * df.LandSlope)

fig = plt.figure(figsize=(20,10))
fig1 = fig.add_subplot(231); sns.regplot(x='LandSlope', y='SalePrice', data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.LandSlope.corr(df['SalePrice'])))

fig2 = fig.add_subplot(232); sns.regplot(x='LotArea', y='SalePrice', data=df);
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.LotArea.corr(df['SalePrice'])))

fig3 = fig.add_subplot(233); sns.regplot(x='LotAreaMultSlope', y='SalePrice', data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.LotAreaMultSlope.corr(df['SalePrice'])))
plt.show()

It is interesting to note that the slope has a low correlation, but as an expected negative. On the other hand, the lot size does not present such a significant correlation, contrary to the interaction between these two characteristics, which is better and also allow us to identify some outliers. Let's take a look at the effect of removing the outliers.

In [23]:
print(df[df.LotArea>155000])
df = df[df.LotArea<155000]
fig = plt.figure(figsize=(20,10))
fig1 = fig.add_subplot(234); sns.regplot(x='LandSlope', y='SalePrice', data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.LandSlope.corr(df['SalePrice'])))

fig2 = fig.add_subplot(235); sns.regplot(x='LotArea', y='SalePrice', data=df);
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.LotArea.corr(df['SalePrice'])))

fig3 = fig.add_subplot(236); sns.regplot(x='LotAreaMultSlope', y='SalePrice', data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.LotAreaMultSlope.corr(df['SalePrice'])))
plt.show()
     SalePrice  LandSlope  LotArea  LotAreaMultSlope
249     277000          1   159000            159000
313     375000          1   215245            215245
335     228950          1   164660            164660

Neighborhood

Let's watch how much the neighborhood may be influencing the price. image

In [24]:
figa = plt.figure(figsize=(20, 5))
g = train.Neighborhood.value_counts().plot(kind='bar', title='Number of Sales by Neighborhood')

figb = plt.figure(figsize=(20, 5))
plt.tight_layout()
df = train[['SalePrice', 'YrSold', 'Neighborhood']]

df['TotalArea'] = (train.TotalBsmtSF.fillna(0) + train.WoodDeckSF.fillna(0) + train.GrLivArea.fillna(0) + 
                   train.LotArea.fillna(0) + train.MasVnrArea.fillna(0) + train.GarageArea.fillna(0) + 
                   train.OpenPorchSF.fillna(0) + train.TSsnPorch.fillna(0) + train.ScreenPorch.fillna(0) + 
                   train.EnclosedPorch.fillna(0) + train.PoolArea.fillna(0) )
 
df = df.groupby(by=['Neighborhood', 'YrSold'], as_index=False).sum()
Neig = df[['SalePrice', 'TotalArea', 'Neighborhood']].groupby(by='Neighborhood', as_index=False).sum()
Neig['NeigPrice'] = Neig.SalePrice / Neig.TotalArea
Neig.drop(['TotalArea', 'SalePrice'], axis=1, inplace=True)
g = Neig.groupby('Neighborhood').NeigPrice.sum().sort_values(ascending = False).\
    plot(kind='bar', title='Mean Sales Prices per Area (Constructed + Lot) by Neighborhood')
Neig = Neig.groupby(by='Neighborhood', as_index=True).NeigPrice.sum().sort_values(ascending = False)

As we can see prices are affected by the neighborhood, yes, if more similar more they attract. But we will delve a little and see how the year and month of the sale also has great influence on the price variation and confirm the seasonality. image

In [25]:
# Yearly Sales Price per Area (Constructed + Lot) by Neighborhood:
df['HistPriceByNeighborhood'] = df.SalePrice / df.TotalArea
df.drop(['TotalArea', 'SalePrice'], axis=1, inplace=True)

# Fill the gaps
df = df.append(pd.DataFrame([['NPkVill', 2006, df.HistPriceByNeighborhood[df.Neighborhood=='NPkVill'].mean()]], 
                            columns=df.columns))

df = df.append(pd.DataFrame([['Veenker', 2009, df.HistPriceByNeighborhood[df.Neighborhood=='Veenker'].mean()]], 
                            columns=df.columns))

df = df.append(pd.DataFrame([['Veenker', 2010, df.HistPriceByNeighborhood[df.Neighborhood=='Veenker'].mean()]], 
                            columns=df.columns))

df = df.append(pd.DataFrame([['Blueste', 2006, df.HistPriceByNeighborhood[df.Neighborhood=='Blueste'].min()]], 
                            columns=df.columns))

df = df.append(pd.DataFrame([['Blueste', 2007, df.HistPriceByNeighborhood[df.Neighborhood=='Blueste'].min()]], 
                            columns=df.columns))

df = df.append(pd.DataFrame([['Blueste', 2010, df.HistPriceByNeighborhood[df.Neighborhood=='Blueste'].max()]], 
                            columns=df.columns))

# Reserve data to merge with all data set of train and test data
YearlyPrice = df
YearlyPrice.columns = ['Neighborhood', 'YrSold', 'YearlyPriceByNeighborhood']

print('                              Yearly Sales Prices per Area (Constructed + Lot) by Neighborhood:')
g = sns.catplot(y= 'YearlyPriceByNeighborhood', x = 'YrSold', col='Neighborhood', data=YearlyPrice, 
               kind="point", aspect=.6, col_wrap=7, height=4, col_order=Neig.index)
                              Yearly Sales Prices per Area (Constructed + Lot) by Neighborhood:
In [26]:
# Monthly Sales Prices per Area (Constructed + Lot) by Neighborhood:
df = train[['SalePrice', 'MoSold', 'Neighborhood']]

df['TotalArea'] = (train.TotalBsmtSF.fillna(0) + train.WoodDeckSF.fillna(0) + train.GrLivArea.fillna(0) + 
                   train.LotArea.fillna(0) + train.MasVnrArea.fillna(0) + train.GarageArea.fillna(0) + 
                   train.OpenPorchSF.fillna(0) + train.TSsnPorch.fillna(0) + train.ScreenPorch.fillna(0) + 
                   train.EnclosedPorch.fillna(0) + train.PoolArea.fillna(0) )

df = df.groupby(by=['Neighborhood', 'MoSold'], as_index=False).sum()
df['HistPriceByNeighborhood'] = df.SalePrice / df.TotalArea
df.drop(['TotalArea', 'SalePrice'], axis=1, inplace=True)

print('                                 Monthly Sales Prices per Area (Constructed + Lot) by Neighborhood:')
g = sns.catplot(y= 'HistPriceByNeighborhood', x = 'MoSold', col='Neighborhood', data=df, 
               kind="point", aspect=.6, col_wrap=7, height=4, col_order=Neig.index )
                                 Monthly Sales Prices per Area (Constructed + Lot) by Neighborhood:
In [27]:
# Outliers from Crawfor Neighborhood
df = train[train.Neighborhood=='Crawfor'][['SalePrice', 'MoSold', 'Neighborhood']]

df['TotalArea'] = (train.TotalBsmtSF.fillna(0) + train.WoodDeckSF.fillna(0) + train.GrLivArea.fillna(0) + 
                   train.LotArea.fillna(0) + train.MasVnrArea.fillna(0) + train.GarageArea.fillna(0) + 
                   train.OpenPorchSF.fillna(0) + train.TSsnPorch.fillna(0) + train.ScreenPorch.fillna(0) + 
                   train.EnclosedPorch.fillna(0) + train.PoolArea.fillna(0) )

df['HistPriceByNeighborhood'] = df.SalePrice / df.TotalArea
df[df.HistPriceByNeighborhood>30]
Out[27]:
SalePrice MoSold Neighborhood TotalArea HistPriceByNeighborhood
1181 392500 11 Crawfor 9875.000 39.747
1405 275000 1 Crawfor 8074.000 34.060
In [28]:
train = train.loc[~(train.SalePrice==392500.0)]
train = train.loc[~((train.SalePrice==275000.0) & (train.Neighborhood=='Crawfor'))]
print('Data observations after outliers deletion:', train.shape[0])
Data observations after outliers deletion: 1449
In [29]:
# Bin neighborhood for trade cases with low observations on monthly sales prices per Area (Constructed + Lot) by Neighborhood:
Neigb = {}
Neigb['Blueste'] = 'Top'     # 32.212721
Neigb['Blmngtn'] = 'Top'     # 28.364756
Neigb['BrDale']  = 'BrDale'  # 24.903923
Neigb['NPkVill'] = 'NPkVill' # 23.681105
Neigb['MeadowV'] = 'High'    # 22.034923
Neigb['StoneBr'] = 'High'    # 20.475090
Neigb['NridgHt'] =  'High'   # 20.209245
Neigb['Somerst'] = 'Somerst' # 19.551888
Neigb['NoRidge'] = 'NoRidge' # 17.038145
Neigb['CollgCr'] = 'CollgCr' # 15.134767
Neigb['SawyerW'] = 'SawyerW' # 13.992995
Neigb['Crawfor'] = 'Crawfor' # 13.773418
Neigb['Gilbert'] = 'Gilbert' # 13.260281
Neigb['BrkSide'] = 'BrkSide' # 12.785202
Neigb['SWISU']   = 'SVN'     # 12.635171
Neigb['Veenker'] = 'SVN'     # 12.343735
Neigb['NWAmes']  = 'SVN'     # 12.066590
Neigb['OldTown'] = 'OldTown' # 11.571331
Neigb['NAmes']   = 'NAmes'   # 11.091393
Neigb['Mitchel'] = 'Mitchel' # 10.936368
Neigb['Edwards'] = 'Edwards' # 10.614919
Neigb['Sawyer']  = 'Sawyer'  #10.334445
Neigb['IDOTRR']  = 'IDOTRR'  # 9.880838
Neigb['Timber']  = 'Timber'  # 8.723326
Neigb['ClearCr'] = 'ClearCr' # 6.113654

# Preper dataset for Sales Price per Area (Constructed + Lot) by Neighborhood:
df = train[['SalePrice', 'MoSold', 'Neighborhood']]

df['TotalArea'] = (train.TotalBsmtSF.fillna(0) + train.WoodDeckSF.fillna(0) + train.GrLivArea.fillna(0) + 
                   train.LotArea.fillna(0) + train.MasVnrArea.fillna(0) + train.GarageArea.fillna(0) + 
                   train.OpenPorchSF.fillna(0) + train.TSsnPorch.fillna(0) + train.ScreenPorch.fillna(0) + 
                   train.EnclosedPorch.fillna(0) + train.PoolArea.fillna(0) )
df['Price'] = df.SalePrice/df.TotalArea

# Cut Outliers from Crawfor Neighborhood
df = df[(((df.Neighborhood == 'Crawfor') & (df.Price<30.)) | (df.Neighborhood != 'Crawfor'))]
df.drop(['Price'], axis=1, inplace=True)

df.Neighborhood = train.Neighborhood.map(Neigb)

df = df.groupby(by=['Neighborhood', 'MoSold'], as_index=False).sum()
df['HistPriceByNeighborhood'] = df.SalePrice / df.TotalArea

# Get the index for order by value
Neig = df[['SalePrice', 'TotalArea', 'Neighborhood']].groupby(by='Neighborhood', as_index=False).sum()
Neig['NeigPrice'] = Neig.SalePrice / Neig.TotalArea
Neig.drop(['TotalArea', 'SalePrice'], axis=1, inplace=True)
Neig = Neig.groupby(by='Neighborhood', as_index=True).NeigPrice.sum().sort_values(ascending = False)

df.drop(['TotalArea', 'SalePrice'], axis=1, inplace=True)

# Fill the gaps
df = df.append(pd.DataFrame([['Top', 1, df.HistPriceByNeighborhood[df.Neighborhood=='Top'].mean()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['Top', 8, df.HistPriceByNeighborhood[df.Neighborhood=='Top'].mean()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['Top', 11, df.HistPriceByNeighborhood[df.Neighborhood=='Top'].mean()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['Top', 12, df.HistPriceByNeighborhood[df.Neighborhood=='Top'].mean()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['BrDale', 1, df.HistPriceByNeighborhood[df.Neighborhood=='BrDale'].mean()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['BrDale', 10, df.HistPriceByNeighborhood[df.Neighborhood=='BrDale'].mean()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['BrDale', 12, df.HistPriceByNeighborhood[df.Neighborhood=='BrDale'].mean()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['NPkVill', 1, df.HistPriceByNeighborhood[df.Neighborhood=='NPkVill'].mean()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['NPkVill', 3, df.HistPriceByNeighborhood[df.Neighborhood=='NPkVill'].mean()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['NPkVill', 5, df.HistPriceByNeighborhood[df.Neighborhood=='NPkVill'].mean()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['NPkVill', 8, df.HistPriceByNeighborhood[df.Neighborhood=='NPkVill'].mean()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['NPkVill', 9, df.HistPriceByNeighborhood[df.Neighborhood=='NPkVill'].mean()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['NPkVill', 11, df.HistPriceByNeighborhood[df.Neighborhood=='NPkVill'].mean()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['NPkVill', 12, df.HistPriceByNeighborhood[df.Neighborhood=='NPkVill'].mean()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['NoRidge', 2, df.HistPriceByNeighborhood[df.Neighborhood=='NoRidge'].mean()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['NoRidge', 9, df.HistPriceByNeighborhood[df.Neighborhood=='NoRidge'].mean()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['NoRidge', 10, df.HistPriceByNeighborhood[df.Neighborhood=='NoRidge'].mean()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['Crawfor', 1, df.HistPriceByNeighborhood[df.Neighborhood=='Crawfor'].max()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['Crawfor', 4, df.HistPriceByNeighborhood[df.Neighborhood=='Crawfor'].mean()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['Timber', 10, df.HistPriceByNeighborhood[df.Neighborhood=='Timber'].mean()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['ClearCr', 1, df.HistPriceByNeighborhood[df.Neighborhood=='ClearCr'].mean()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['ClearCr', 2, df.HistPriceByNeighborhood[df.Neighborhood=='ClearCr'].mean()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['ClearCr', 8, df.HistPriceByNeighborhood[df.Neighborhood=='ClearCr'].mean()]], 
                            columns=df.columns))
df = df.append(pd.DataFrame([['Edwards', 12, df.HistPriceByNeighborhood[df.Neighborhood=='Edwards'].mean()]], 
                            columns=df.columns))


# Reserve data to merge with all data set of train and test data
MonthlyPrice = df
MonthlyPrice.columns = ['Neigb', 'MoSold', 'MonthlyPriceByNeighborhood']

print('                         Monthly Hist Sales Prices per Area (Construct + Lot) by Neighborhood:')
g = sns.catplot(y= 'MonthlyPriceByNeighborhood', x = 'MoSold', col='Neigb', data=MonthlyPrice, 
               kind="point", aspect=.6, col_wrap=7, height=4, col_order=Neig.index )
                         Monthly Hist Sales Prices per Area (Construct + Lot) by Neighborhood:

image As we expected, the seasonality does have some effect, but of course we draw this conclusion based only on the above graphs is precipitated if not erroneous, given that even having restricted the views still exist houses with different characteristics in the same neighborhood.

However, this is sufficient to understand that the timing of the sale matters, so the model will probably have to take this into account, or this will be part of the residual errors.

Check the Dependent Variable - SalePrice:

Since most of the machine learning algorithms start from the principle that our data has a normal distribution, we first take a look at the distribution of our dependent variable. For this, I create a procedure to plot the Sales Distribution and QQ-plot to identify substantive departures from normality, likes outliers, skewness and kurtosis. image

In [30]:
def QQ_plot(data, measure):
    fig = plt.figure(figsize=(20,7))

    #Get the fitted parameters used by the function
    (mu, sigma) = norm.fit(data)

    #Kernel Density plot
    fig1 = fig.add_subplot(121)
    sns.distplot(data, fit=norm)
    fig1.set_title(measure + ' Distribution ( mu = {:.2f} and sigma = {:.2f} )'.format(mu, sigma), loc='center')
    fig1.set_xlabel(measure)
    fig1.set_ylabel('Frequency')

    #QQ plot
    fig2 = fig.add_subplot(122)
    res = probplot(data, plot=fig2)
    fig2.set_title(measure + ' Probability Plot (skewness: {:.6f} and kurtosis: {:.6f} )'.format(data.skew(), data.kurt()), loc='center')

    plt.tight_layout()
    plt.show()
In [31]:
QQ_plot(train.SalePrice, 'Sales Price')
In [32]:
#We use the numpy fuction log1p which applies log(1+x) to all elements of the column
train.SalePrice = np.log1p(train.SalePrice)

QQ_plot(train.SalePrice, 'Log1P of Sales Price')

From the first graph above we can see that Sales Price distribution is skewed, has a peak, it deviates from normal distribution and is positively biased. From the Probability Plot, we could see that Sales Price also does not align with the diagonal red line which represent normal distribution. The form of its distribution confirm that is a skewed right.

With skewness positive of 1.9, we confirm the lack of symmetry and indicate that Sales Price are skewed right, as we can see too at the Sales Distribution plot, skewed right means that the right tail is long relative to the left tail. The skewness for a normal distribution is zero, and any symmetric data should have a skewness near zero. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.

Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers, and positive kurtosis indicates a heavy-tailed distribution and negative kurtosis indicates a light tailed distribution. So, with 6.5 of positive kurtosis Sales Price are definitely heavy-tailed and has some outliers that we need take care.

Note that in contrast to common belief, training a linear regression model does not require that the explanatory or target variables are normally distributed. The normality assumption is only a requirement for certain statistical tests and hypothesis tests.

So, I try some linear regressors with both, with and without transformation of SalePrice to check their results.

Test hypothesis of better feature: Construction Area

Let's call a specialist to help us create a new feature that sum all area features, the construct area, and evaluates if is better than their parcels. image

In [33]:
df = train[['SalePrice', 'GrLivArea']]
df['ConstructArea'] = (train.TotalBsmtSF.fillna(0) + train.WoodDeckSF.fillna(0) + train.GrLivArea.fillna(0) + 
                       train.MasVnrArea.fillna(0) + train.GarageArea.fillna(0) + train.OpenPorchSF.fillna(0) + 
                       train.TSsnPorch.fillna(0) + train.ScreenPorch.fillna(0) + train.EnclosedPorch.fillna(0) + 
                       train.PoolArea.fillna(0) )
                         
fig8 = plt.figure(figsize=(20,5))
fig9 = fig8.add_subplot(121); sns.regplot((df.ConstructArea), df.SalePrice)
plt.title('Cosntruct Area correlation {:1.2f}'.format(df.ConstructArea.corr(df.SalePrice)))

fig10 = fig8.add_subplot(122); sns.regplot((df.GrLivArea.fillna(0)), df.SalePrice)
tit = 'Livig Area correlation is {:1.2f} and is {:1.2f} correlated to Construct Area'
plt.title(tit.format(df.GrLivArea.fillna(0).corr(df.SalePrice), df.GrLivArea.corr(df.ConstructArea)))
plt.show()

As we can see, our built metric performs better than its parcels, even more than the living area. Besides better correlation, it presents less bias and variance.

This may lead us to think of a model option that uses only the constructed area, without including any of the parcels, that would be replaced by an indication variable of existence or not if there is no categorical variable associated with it.

We can also use them to compose other variables and finally remove them.

Anyway the living area seems useless now, to prove it let's go see how a single linear regressor perform with this options:

In [34]:
def print_results():
    # The coefficients
    print('Coefficients: \n', lr.coef_)
    # The mean squared error
    print("Root mean squared error: %.4f"
          % np.expm1(np.sqrt(mean_squared_error(y_test, y_pred))))
    # Explained variance score: 1 is perfect prediction
    print('Variance score: %.4f' % r2_score(y_test, y_pred))
    print('--------------------------------------------------------------------------------\n')
    
scale = RobustScaler()
y = df.SalePrice

X = scale.fit_transform(df[['ConstructArea', 'GrLivArea']])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)

lr = LinearRegression()

print('1. Linear regressor with only living Area:')
lr.fit(X_train[: , 1].reshape(-1, 1), y_train)

# Make predictions using the testing set
y_pred = lr.predict(X_test[: , 1].reshape(-1, 1))
print_results()

print('2. Linear regressor with bouth features:')
lr.fit(X_train, y_train)

# Makepredictions using the testing set
y_pred = lr.predict(X_test)
print_results()

print('3. Linear regressor with only Construct Area:')
lr = LinearRegression()
lr.fit(X_train[: , 0].reshape(-1, 1), y_train)

# Makepredictions using the testing set
y_pred = lr.predict(X_test[: , 0].reshape(-1, 1))
print_results()

print('4. Polinomial regressor of orden 3 with only Construction Area:')
# create polynomial features
cubic = PolynomialFeatures(degree=3, interaction_only=False, include_bias=False)
X_cubic = cubic.fit_transform(X_train[: , 0].reshape(-1, 1))

# cubic fit
lr = lr.fit(X_cubic, y_train)
y_pred = lr.predict(cubic.fit_transform(X_test[: , 0].reshape(-1, 1)))
print_results()

print('5. Polinomial regressor of orden 3 with both features:')
# create polynomial features
cubic = PolynomialFeatures(degree=3, interaction_only=False, include_bias=False)
X_cubic = cubic.fit_transform(X_train)

# cubic fit
lr = lr.fit(X_cubic, y_train)
y_pred = lr.predict(cubic.fit_transform(X_test))
print_results()
1. Linear regressor with only living Area:
Coefficients: 
 [0.37373448]
Root mean squared error: 0.3135
Variance score: 0.5547
--------------------------------------------------------------------------------

2. Linear regressor with bouth features:
Coefficients: 
 [0.40446921 0.02013973]
Root mean squared error: 0.2275
Variance score: 0.7484
--------------------------------------------------------------------------------

3. Linear regressor with only Construct Area:
Coefficients: 
 [0.42050449]
Root mean squared error: 0.2276
Variance score: 0.7481
--------------------------------------------------------------------------------

4. Polinomial regressor of orden 3 with only Construction Area:
Coefficients: 
 [ 0.437434   -0.0420375   0.00630214]
Root mean squared error: 0.2212
Variance score: 0.7608
--------------------------------------------------------------------------------

5. Polinomial regressor of orden 3 with both features:
Coefficients: 
 [ 0.41182783  0.03846449 -0.05035786  0.06608028 -0.07134685 -0.01285254
  0.08643164 -0.11203676  0.04362431]
Root mean squared error: 0.2225
Variance score: 0.7584
--------------------------------------------------------------------------------

According to our specialist, the above results show to us that:

  • we can safely eliminate the living area, and as there are no records with it zero or null we will not create an existence indicator for it.
  • We may be able to discard other area metrics, especially those that have many zeros for nulls, which contribute little to accuracy and even to reduce multicollinearity.
  • the polynomial transformation of 3th degree presents improvements of 2.6% and 1.8% of RMSE and R2 respectively. So, now we have a simple regressor from our specialist to bit!

Before create new features and other test, is better to make the data cleaning and fill nulls.

3. Check Data Quality:

image

Nulls Check:

In this section, I am going to fix the 34 predictors that contains missing values. I will go through them working my way down from most NAs until I have fixed them all. If I stumble upon a variable that actually forms a group with other variables, I will also deal with them as a group. For instance, there are multiple variables that relate to Pool, Garage, and Basement.

In [35]:
ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)

print("All data observations have {0} rows and {1} columns".format(all_data.shape[0], all_data.shape[1]))
details = rstr(all_data)
print("All data have {1:2.2%} of null at {0} features".format(details[details.nulls>0].shape[0], 
                                                   details.nulls[details.nulls>0].sum()/all_data.size))
print('\nBelow the table with all columns with nulls oredered by missin ration:')
display(details.loc[(details.nulls>0), 'types':'uniques'].sort_values(by= 'missing ration', ascending=False))
All data observations have 2908 rows and 80 columns
Data shape: (2908, 80)
___________________________
Data types:
 object     43
int64      25
float64    12
Name: types, dtype: int64
___________________________
All data have 5.98% of null at 34 features

Below the table with all columns with nulls oredered by missin ration:
types counts distincts nulls missing ration uniques
PoolQC object 9 4 2899 99.691 [[nan, Ex, Fa, Gd]]
MiscFeature object 104 5 2804 96.424 [[nan, Shed, Gar2, Othr, TenC]]
Alley object 197 3 2711 93.226 [[nan, Grvl, Pave]]
Fence object 570 5 2338 80.399 [[nan, MnPrv, GdWo, GdPrv, MnWw]]
FireplaceQu object 1493 6 1415 48.659 [[nan, TA, Gd, Fa, Ex, Po]]
LotFrontage float64 2423 129 485 16.678 [[65.0, 80.0, 68.0, 60.0, 84.0, 85.0, 75.0, na...
GarageCond object 2752 6 156 5.365 [[TA, Fa, nan, Gd, Po, Ex]]
GarageQual object 2752 6 156 5.365 [[TA, Fa, Gd, nan, Ex, Po]]
GarageYrBlt float64 2752 104 156 5.365 [[2003.0, 1976.0, 2001.0, 1998.0, 2000.0, 1993...
GarageFinish object 2752 4 156 5.365 [[RFn, Unf, Fin, nan]]
GarageType object 2754 7 154 5.296 [[Attchd, Detchd, BuiltIn, CarPort, nan, Basme...
BsmtExposure object 2826 5 82 2.820 [[No, Gd, Mn, Av, nan]]
BsmtCond object 2826 5 82 2.820 [[TA, Gd, nan, Fa, Po]]
BsmtQual object 2827 5 81 2.785 [[Gd, TA, Ex, nan, Fa]]
BsmtFinType2 object 2828 7 80 2.751 [[Unf, BLQ, nan, ALQ, Rec, LwQ, GLQ]]
BsmtFinType1 object 2829 7 79 2.717 [[GLQ, ALQ, Unf, Rec, BLQ, nan, LwQ]]
MasVnrType object 2884 5 24 0.825 [[BrkFace, None, Stone, BrkCmn, nan]]
MasVnrArea float64 2885 443 23 0.791 [[196.0, 0.0, 162.0, 350.0, 186.0, 240.0, 286....
MSZoning object 2904 6 4 0.138 [[RL, RM, C (all), FV, RH, nan]]
Utilities object 2906 3 2 0.069 [[AllPub, NoSeWa, nan]]
Functional object 2906 8 2 0.069 [[Typ, Min1, Maj1, Min2, Mod, Maj2, Sev, nan]]
BsmtHalfBath float64 2906 4 2 0.069 [[0.0, 1.0, 2.0, nan]]
BsmtFullBath float64 2906 5 2 0.069 [[1.0, 0.0, 2.0, 3.0, nan]]
GarageCars float64 2907 7 1 0.034 [[2.0, 3.0, 1.0, 0.0, 4.0, 5.0, nan]]
Exterior2nd object 2907 17 1 0.034 [[VinylSd, MetalSd, Wd Shng, HdBoard, Plywood,...
Exterior1st object 2907 16 1 0.034 [[VinylSd, MetalSd, Wd Sdng, HdBoard, BrkFace,...
KitchenQual object 2907 5 1 0.034 [[Gd, TA, Ex, Fa, nan]]
Electrical object 2907 6 1 0.034 [[SBrkr, FuseF, FuseA, FuseP, Mix, nan]]
BsmtUnfSF float64 2907 1134 1 0.034 [[150.0, 284.0, 434.0, 540.0, 490.0, 64.0, 317...
BsmtFinSF2 float64 2907 273 1 0.034 [[0.0, 32.0, 668.0, 486.0, 93.0, 491.0, 506.0,...
BsmtFinSF1 float64 2907 989 1 0.034 [[706.0, 978.0, 486.0, 216.0, 655.0, 732.0, 13...
SaleType object 2907 10 1 0.034 [[WD, New, COD, ConLD, ConLI, CWD, ConLw, Con,...
TotalBsmtSF float64 2907 1055 1 0.034 [[856.0, 1262.0, 920.0, 756.0, 1145.0, 796.0, ...
GarageArea float64 2907 600 1 0.034 [[548.0, 460.0, 608.0, 642.0, 836.0, 480.0, 63...
In [36]:
class DataFrameImputer(TransformerMixin):

    def __init__(self):
        """
        Impute missing values:
        - Columns of dtype object are imputed with the most frequent value in column.
        - Columns of other types are imputed with mean of column.
        """
    def fit(self, X, y=None):

        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
            index=X.columns)

        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

Some Observations Respect Data Quality:

The total training observations are 1,460 and have 79 features ( 3 float64, 33 int64, 43 object ) with 19 columns with nulls.

All data observations, including test data set, have 2919 rows and 79 features, where 6.06% are nulls, but found nulls at 34 different features. So the test dataset has null in features that training dataset doesn't have!

Based on feature description provide, A feature that has NA means it is absent

First, before we assume this as a total reality, we need check some quality issues, like the record has Garage, but doesn't have Garage Quality and vice versa.

  • 14 has few null, so are good candidates for imputer strategies:
    • GarageFinish 1379 object. Interior finish of the garage.
    • GarageQual 1379 object: Garage quality.
    • GarageCond 1379 object: Garage condition.
    • GarageType 1379 object: Garage location
    • GarageYrBlt 1379 float64: Year garage was built
    • Electrical 1459 object. Only one, can apply the most common.
    • MasVnrType 1452 object: is the masonry veneer type, hasn't CBlock!
    • MasVnrArea 1452 float64: Masonry veneer area in square feet.
    • BsmtExposure 1422 object: Refers to walkout or garden level walls.
    • BsmtFinType2 1422 object: Rating of basement finished area (if multiple types).
    • BsmtQual 1423 object: Evaluates the height of the basement Doesn't have PO.
    • BsmtCond 1423 object: Evaluates the general condition of the basement.
    • BsmtFinType1 1423 object: Rating of basement finished area (if multiple types).
    • LotFrontage 1201 float64: is the linear feet of street connected to property.

  • 5 has miss ration grater than 47%, maybe candidates to exclude, especially if their have below correlation with price.
    • Fence 281 object: Fence quality.
    • FireplaceQu 770 object: Fireplace quality.
    • MiscFeature 54 object: Miscellaneous feature not covered in other categories.
    • Alley 91 object: is the type of alley access to property.
    • PoolQC 7 object: Pool quality. Attention for the related other feature PoolArea: Pool area in square feet

Some numeric data are ordinal or categorical already translate to codes. We need correct identify the ordinal from the description and can maintain as is, but need to change categorical.

Utilities : For this categorical feature all records are "AllPub", except for one "NoSeWa" and 2 NA . Since the house with 'NoSewa' is in the training set, this feature won't help in predictive modeling. We can then safely remove it.

In [37]:
all_data.drop('Utilities', axis=1, inplace=True)

Identify the Most Common Electrical:

image

In [38]:
display(all_data.Electrical.value_counts())

all_data.Electrical = all_data.Electrical.fillna('SBrkr')
SBrkr    2661
FuseA     187
FuseF      50
FuseP       8
Mix         1
Name: Electrical, dtype: int64

Fill Missing Values of Garage Features:

Identify if has some special cases where we find some garage feature inputted, where's others garages features are null. image

In [39]:
feat = ['GarageYrBlt', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'GarageArea', 'GarageCars']
print(all_data[feat].isnull().sum())
print("GarageArea equal a 0: ", (all_data.GarageArea==0).sum())
print("GarageCars equal a 0: ", (all_data.GarageCars==0).sum())
c = all_data[~all_data.GarageType.isnull()][feat]
c[c.GarageYrBlt.isnull()]
GarageYrBlt     156
GarageType      154
GarageFinish    156
GarageQual      156
GarageCond      156
GarageArea        1
GarageCars        1
dtype: int64
GarageArea equal a 0:  154
GarageCars equal a 0:  154
Out[39]:
GarageYrBlt GarageType GarageFinish GarageQual GarageCond GarageArea GarageCars
2115 nan Detchd NaN NaN NaN 360.000 1.000
2565 nan Detchd NaN NaN NaN nan nan

Group by GarageType

Fill missing value with median or mode where GarageType equal a Detchd and 0 or NA for the others.

In [40]:
all_data.GarageType = all_data.GarageType.fillna('NA')

# Group by GarageType and fill missing value with median where GarageType=='Detchd' and 0 for the others
cmedian = all_data[all_data.GarageType=='Detchd'].GarageArea.median()
print("GarageArea median of Type Detchd:", cmedian)
all_data.loc[all_data.GarageType=='Detchd', 'GarageArea'] = all_data.loc[all_data.GarageType=='Detchd', 
                                                                         'GarageArea'].fillna(cmedian)
all_data.GarageArea = all_data.GarageArea.fillna(0)

cmedian = all_data[all_data.GarageType=='Detchd'].GarageCars.median()
print("GarageCars median of Type Detchd:", cmedian)
all_data.loc[all_data.GarageType=='Detchd', 'GarageCars'] = all_data.loc[all_data.GarageType=='Detchd', 
                                                                         'GarageCars'].fillna(cmedian)
all_data.GarageCars = all_data.GarageCars.fillna(0)

cmedian = all_data[all_data.GarageType=='Detchd'].GarageYrBlt.median()
print("GarageYrBlt median of Type Detchd:", cmedian)
all_data.loc[all_data.GarageType=='Detchd', 'GarageYrBlt'] = all_data.loc[all_data.GarageType=='Detchd', 
                                                                          'GarageYrBlt'].fillna(cmedian)
all_data.GarageYrBlt = all_data.GarageYrBlt.fillna(0)

# Group by GarageType and fill missing value with mode where GarageType=='Detchd' and 'NA' for the others
cmode = all_data[all_data.GarageType=='Detchd'].GarageFinish.mode()[0]
print("GarageFinish mode of Type Detchd:", cmode)
all_data.loc[all_data.GarageType=='Detchd', 'GarageFinish'] = all_data.loc[all_data.GarageType=='Detchd', 
                                                                           'GarageFinish'].fillna(cmode)
all_data.GarageFinish = all_data.GarageFinish.fillna('NA')

cmode = all_data[all_data.GarageType=='Detchd'].GarageQual.mode()[0]
print("GarageQual mode of Type Detchd: %s" %cmode)
all_data.loc[all_data.GarageType=='Detchd', 'GarageQual'] = all_data.loc[all_data.GarageType=='Detchd', 
                                                                         'GarageQual'].fillna(cmode)
all_data.GarageQual = all_data.GarageQual.fillna('NA')

cmode = all_data[all_data.GarageType=='Detchd'].GarageCond.mode()[0]
print("GarageCond mode of Type Detchd:", cmode)
all_data.loc[all_data.GarageType=='Detchd', 'GarageCond'] = all_data.loc[all_data.GarageType=='Detchd', 
                                                                         'GarageCond'].fillna(cmode)
all_data.GarageCond = all_data.GarageCond.fillna('NA')
GarageArea median of Type Detchd: 397.5
GarageCars median of Type Detchd: 2.0
GarageYrBlt median of Type Detchd: 1962.0
GarageFinish mode of Type Detchd: Unf
GarageQual mode of Type Detchd: TA
GarageCond mode of Type Detchd: TA

Check if all nulls of Garage features are inputed

image

In [41]:
print(all_data[feat].isnull().sum())
GarageYrBlt     0
GarageType      0
GarageFinish    0
GarageQual      0
GarageCond      0
GarageArea      0
GarageCars      0
dtype: int64

Masonry veneer

Check Nulls and Data Quality Problems: image

In [42]:
feat = ['MasVnrArea', 'MasVnrType']
c = all_data[~all_data.MasVnrArea.isnull()][feat]
print('Masonry veneer Nulls:')
print(all_data[feat].isnull().sum(), '\n')
print("Has MasVnrType but not has MasVnrArea:",all_data[~all_data.MasVnrType.isnull()].MasVnrArea.isnull().sum())
print("Has MasVnrArea but not has MasVnrType:",c[c.MasVnrType.isnull()].MasVnrArea.count())
print(c[c.MasVnrType.isnull()], '\n')

print("Has MasVnrType but MasVnrArea is equal a Zero:",c[c.MasVnrArea==0].MasVnrType.count())
print("MasVnrArea equal a 0: ", (all_data.MasVnrArea==0).sum(), '\n')
print("Has Type and Area == 0:")
print(c[c.MasVnrArea==0].MasVnrType.value_counts(), '\n')

print("Type None with Area > 0 ?")
print(all_data.loc[(all_data.MasVnrType=='None') & (all_data.MasVnrArea>0), ['MasVnrType','MasVnrArea']])

print('\n What is the most comumn MasVnrType after None?')
print(all_data.MasVnrType.value_counts())
Masonry veneer Nulls:
MasVnrArea    23
MasVnrType    24
dtype: int64 

Has MasVnrType but not has MasVnrArea: 0
Has MasVnrArea but not has MasVnrType: 1
      MasVnrArea MasVnrType
2599     198.000        NaN 

Has MasVnrType but MasVnrArea is equal a Zero: 1734
MasVnrArea equal a 0:  1734 

Has Type and Area == 0:
None       1731
BrkFace       2
Stone         1
Name: MasVnrType, dtype: int64 

Type None with Area > 0 ?
     MasVnrType  MasVnrArea
622        None     288.000
769        None       1.000
1222       None       1.000
1291       None     344.000
1325       None     312.000
1658       None     285.000
2441       None       1.000

 What is the most comumn MasVnrType after None?
None       1738
BrkFace     877
Stone       244
BrkCmn       25
Name: MasVnrType, dtype: int64

Correct masonry veneer types

Change to BrkFace the masonry veneer types Nulls and Nones wheres records has masonry Veneer Area

In [43]:
# All None Types with Are greater than 0 update to BrkFace type
all_data.loc[(all_data.MasVnrType=='None') & (all_data.MasVnrArea>0), ['MasVnrType']] = 'BrkFace'

# All Types null with Are greater than 0 update to BrkFace type
all_data.loc[(all_data.MasVnrType.isnull()) & (all_data.MasVnrArea>0), ['MasVnrType']] = 'BrkFace'

# All Types different from None with Are equal to 0 update to median Area of no None types with Areas
all_data.loc[(all_data.MasVnrType!='None') & 
             (all_data.MasVnrArea==0), ['MasVnrArea']] = all_data.loc[(all_data.MasVnrType!='None') & 
                                                                      (all_data.MasVnrArea>0), ['MasVnrArea']].median()[0]
# Filling 0 and None for records wheres both are nulls
all_data.MasVnrArea = all_data.MasVnrArea.fillna(0)
all_data.MasVnrType = all_data.MasVnrType.fillna('None')

Check if all nulls of masonry veneer types are updated

In [44]:
c = all_data[~all_data.MasVnrType.isnull()][feat]
print('Masonry veneer Nulls:')
print(all_data[feat].isnull().sum(), '\n')
Masonry veneer Nulls:
MasVnrArea    0
MasVnrType    0
dtype: int64 

Check and Input Basement Features Nulls:

image

In [45]:
feat = ['BsmtFinSF1','BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF','BsmtFullBath', 'BsmtHalfBath', 
        'BsmtQual', 'BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2']
print(all_data[feat].isnull().sum())
print("BsmtFinSF1 equal a 0: ", (all_data.BsmtFinSF1==0).sum())
print("BsmtFinSF2 equal a 0: ", (all_data.BsmtFinSF2==0).sum())
print("BsmtUnfSF equal a 0: ", (all_data.BsmtUnfSF==0).sum())
print("TotalBsmtSF equal a 0: ", (all_data.TotalBsmtSF==0).sum())
print("BsmtFullBath equal a 0: ", (all_data.BsmtFullBath==0).sum())
print("BsmtHalfBath equal a 0: ", (all_data.BsmtHalfBath==0).sum())
BsmtFinSF1       1
BsmtFinSF2       1
BsmtUnfSF        1
TotalBsmtSF      1
BsmtFullBath     2
BsmtHalfBath     2
BsmtQual        81
BsmtCond        82
BsmtExposure    82
BsmtFinType1    79
BsmtFinType2    80
dtype: int64
BsmtFinSF1 equal a 0:  927
BsmtFinSF2 equal a 0:  2560
BsmtUnfSF equal a 0:  240
TotalBsmtSF equal a 0:  78
BsmtFullBath equal a 0:  1702
BsmtHalfBath equal a 0:  2734
In [46]:
# No Basement Av is the most comumn BsmtExposure. 
display(all_data.BsmtExposure.value_counts())

# Update nulls Exposure to Av wheres TotalBsmntSF is grenter tham zero 
all_data.loc[(~all_data.TotalBsmtSF.isnull()) & 
             (all_data.BsmtExposure.isnull()) & 
             (all_data.TotalBsmtSF>0), 'BsmtExposure'] = 'Av'
No    1899
Av     417
Gd     271
Mn     239
Name: BsmtExposure, dtype: int64
In [47]:
# TA is the most comumn BsmtQual. 
display(all_data.BsmtQual.value_counts())

# We use TA for all cases wheres has same evidenci that the house has Basement
all_data.loc[(~all_data.TotalBsmtSF.isnull()) & 
             (all_data.BsmtQual.isnull()) & 
             (all_data.TotalBsmtSF>0), 'BsmtQual'] = 'TA'
TA    1278
Gd    1208
Ex     253
Fa      88
Name: BsmtQual, dtype: int64
In [48]:
# TA is the most comumn BsmtCond. 
display(all_data.BsmtCond.value_counts())

# We use TA for all cases wheres has same evidenci that the house has Basement
all_data.loc[(~all_data.TotalBsmtSF.isnull()) & (all_data.BsmtCond.isnull()) & (all_data.TotalBsmtSF>0), 'BsmtCond'] = 'TA'
TA    2597
Gd     121
Fa     103
Po       5
Name: BsmtCond, dtype: int64
In [49]:
# Unf is the most comumn BsmtFinType2. 
display(all_data.BsmtFinType2.value_counts())

# We use Unf for all cases wheres BsmtFinType2 is null but BsmtFinSF2 is grater than Zro
all_data.loc[(all_data.BsmtFinSF2>0) & (all_data.BsmtFinType2.isnull()) , 'BsmtFinType2'] = 'Unf'
Unf    2482
Rec     105
LwQ      87
BLQ      68
ALQ      52
GLQ      34
Name: BsmtFinType2, dtype: int64
In [50]:
# See below that we have one case where BsmtFinType2 is BLQ and the Area is Zero, but its area was inputed at Unfinesh
display(all_data[(all_data.BsmtFinSF2==0) & (all_data.BsmtFinType2!='Unf') & (~all_data.BsmtFinType2.isnull())][feat])

# Correct BsmtFinSF2 and BsmtUnfSF:
all_data.loc[(all_data.BsmtFinSF2==0) & (all_data.BsmtFinType2!='Unf') & (~all_data.BsmtFinType2.isnull()), 'BsmtFinSF2'] = 354.0
all_data.loc[(all_data.BsmtFinSF2==0) & (all_data.BsmtFinType2!='Unf') & (~all_data.BsmtFinType2.isnull()), 'BsmtUnfSF'] = 0.0
BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2
1459 1051.000 0.000 354.000 1405.000 1.000 0.000 Gd TA No GLQ BLQ
In [51]:
# All these cases are clear don´t have basement. 
print("Rest cases where Cond is Null", (all_data[all_data.BsmtCond.isnull()]).shape[0], '\n')
print('Others categories basement features are Null when Cond is Null:\n',
      (all_data[all_data.BsmtCond.isnull()][['BsmtQual', 'BsmtCond', 'BsmtExposure',
                                             'BsmtFinType1' , 'BsmtFinType2']]).isnull().sum())
print('\nOthers numerics basement features are Null or Zero when Cond is Null:\n',
      all_data[all_data.BsmtCond.isnull()][['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF' ,'TotalBsmtSF',
                                            'BsmtFullBath', 'BsmtHalfBath']].sum())
print("\nThe particular cases where's numeric basement features see below are Null were included in the previous groups:") 
display(all_data[all_data.BsmtFullBath.isnull()][feat])

# So, we update these Zero or NA according to their dictionary:
nulls_cols = {'BsmtExposure': 'NA', 'BsmtFinType2': 'NA', 'BsmtQual': 'NA', 'BsmtCond': 'NA', 'BsmtFinType1': 'NA',
              'BsmtFinSF1': 0, 'BsmtFinSF2': 0, 'BsmtUnfSF': 0 ,'TotalBsmtSF': 0, 'BsmtFullBath': 0, 'BsmtHalfBath': 0}

all_data = all_data.fillna(value=nulls_cols)

print('\nFinal Check if all nulls basement features are treated:', all_data[feat].isnull().sum().sum())
Rest cases where Cond is Null 79 

Others categories basement features are Null when Cond is Null:
 BsmtQual        79
BsmtCond        79
BsmtExposure    79
BsmtFinType1    79
BsmtFinType2    79
dtype: int64

Others numerics basement features are Null or Zero when Cond is Null:
 BsmtFinSF1     0.000
BsmtFinSF2     0.000
BsmtUnfSF      0.000
TotalBsmtSF    0.000
BsmtFullBath   0.000
BsmtHalfBath   0.000
dtype: float64

The particular cases where's numeric basement features see below are Null were included in the previous groups:
BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2
2109 nan nan nan nan nan nan NaN NaN NaN NaN NaN
2177 0.000 0.000 0.000 0.000 nan nan NaN NaN NaN NaN NaN
Final Check if all nulls basement features are treated: 0

Lot Frontage - Check and Fill Nulls

image

In [52]:
# Group by Neigborhood and fill missing value with Lot frontage median of the respect Neigborhood
NegMean = all_data.groupby('Neighborhood').LotFrontage.mean()

all_data.loc.LotFrontage = all_data[['Neighborhood', 'LotFrontage']].\
                           apply(lambda x: NegMean[x.Neighborhood] if np.isnan(x.LotFrontage) else x.LotFrontage, axis=1)

Pool Quality - Fill Nulls

image Probably models won't use Pools Features, since they has few correlation to price (0.069798) and more than 99% of missing. But for the moment, we still filling null of PoolQC that has Area with based on the Overall Quality of the houses divided by 2.

In [53]:
PoolQC = {0: 'NA', 1: 'Po', 2: 'Fa', 3: 'TA', 4: 'Gd', 5: 'Ex'}

all_data.loc[(all_data.PoolArea>0) & (all_data.PoolQC.isnull()), ['PoolQC']] =\
        ((all_data.loc[(all_data.PoolArea>0) & (all_data.PoolQC.isnull()), ['OverallQual']]/2).round()).\
        apply(lambda x: x.map(PoolQC))

all_data.PoolQC = all_data.PoolQC.fillna('NA')

Functional - Miss Values Treatment

image Since Functional description include the statement "Assume typical unless deductions are warranted", we assume "Typ" for nulls cases.

In [54]:
all_data.Functional = all_data.Functional.fillna('Typ')

Fireplace Quality - Miss Values Treatment

Since all Fireplace Quality nulls has Fireplaces equal a zero, its sure that Fireplace Quality could be update to NA. image

In [55]:
all_data.loc[(all_data.Fireplaces==0) & (all_data.FireplaceQu.isnull()), ['FireplaceQu']] = 'NA'

Kitchen Quality - Miss Values Treatment

image Since all Kitchen Quality nulls has Kitchen Above Ground grater than zero, we assume mode for Kitchen Quality.

In [56]:
all_data.loc[(all_data.KitchenAbvGr>0) & (all_data.KitchenQual.isnull()), 
             ['KitchenQual']] = all_data.KitchenQual.mode()[0]

Alley, Fence and Miscellaneous Feature - Miss Values Treatment

  • Miscellaneous feature not covered in other categories.
  • Alley has a few records, and is not really common to have alley in properties.
  • It's not uncommon to see properties without fence at USA. image So, don't lose time and update nulls for NA.
In [57]:
all_data.Alley = all_data.Alley.fillna('NA')
all_data.Fence = all_data.Fence.fillna('NA')
all_data.MiscFeature = all_data.MiscFeature.fillna('NA')

Back to the Past! Garage Year Build from 2207

image

In [58]:
display(all_data.loc[all_data.GarageYrBlt==2207, ['GarageYrBlt', 'YearBuilt']])
all_data.loc[all_data.GarageYrBlt==2207.0, 'GarageYrBlt'] = 2007.0
GarageYrBlt YearBuilt
2581 2207.000 2006

Final Check and Filling Nulls

image

In [59]:
all_data = DataFrameImputer().fit_transform(all_data)

# Final check if we have some NA
print("Data nulls:", all_data.isnull().sum().sum())
Data nulls: 0

Mapping Ordinal Features

Any attribute or feature that is categorical in nature represents discrete values that belong to a specific finite set of categories or classes. The ordinal are special category type that can also be ordered based on rules on the context. So, ordinal categorical variables can be ordered and sorted on the basis of their values and hence these values have specific significance such that their order makes sense. image Like nominal features, even ordinal features might be present in text form and you need to map and transform them into their numeric representation.

Below, you can see that it is really easy to build your own transformation mapping scheme with the help of Python dictionaries and use the map function from pandas to transform the ordinal feature, and preserve its significance.

In [60]:
def map_ordinals(data):
    
    # LandSlope: Slope of property
    LandSlope = {}
    LandSlope['Gtl'] = 3 #'Gentle slope'
    LandSlope['Mod'] = 2 #'Moderate Slope'
    LandSlope['Sev'] = 1 #'Severe Slope'

    data.LandSlope = data.LandSlope.map(LandSlope)
        
    # ExterQual: Evaluates the quality of the material on the exterior 
    ExterQual = {}
    ExterQual['Ex'] = 5 #'Excellent'
    ExterQual['Gd'] = 4 #'Good'
    ExterQual['TA'] = 3 #'Average/Typical'
    ExterQual['Fa'] = 2 #'Fair'
    ExterQual['Po'] = 1 #'Poor'
    ExterQual['NA'] = 0 #'NA'

    data.ExterQual = data.ExterQual.map(ExterQual)

    # ExterCond: Evaluates the present condition of the material on the exterior
    data.ExterCond = data.ExterCond.map(ExterQual)

    #HeatingQC: Heating quality and condition
    data.HeatingQC = data.HeatingQC.map(ExterQual)

    # KitchenQual: Kitchen quality
    data.KitchenQual = data.KitchenQual.map(ExterQual)

    # FireplaceQu: Fireplace quality
    data.FireplaceQu = data.FireplaceQu.map(ExterQual)

    # GarageCond: Garage Conditionals
    data.GarageCond = data.GarageCond.map(ExterQual)

    PavedDrive = {}
    PavedDrive['Y'] = 3 #'Paved'
    PavedDrive['P'] = 2 #'Partial Pavement'
    PavedDrive['N'] = 1 #'Dirt/Gravel'

    data.PavedDrive = data.PavedDrive.map(PavedDrive)

    # LotShape: General shape of property
    LotShape = {}
    LotShape['Reg'] = 4 #'Regular'
    LotShape['IR1'] = 3 #'Slightly irregular'
    LotShape['IR2'] = 2 #'Moderately Irregular'
    LotShape['IR3'] = 1 #'Irregular'

    data.LotShape = data.LotShape.map(LotShape)

    # BsmtQual: Evaluates the height of the basement
    BsmtQual = {}
    BsmtQual['Ex'] = 5 #'Excellent (100+ inches)'
    BsmtQual['Gd'] = 4 #'Good (90-99 inches)'
    BsmtQual['TA'] = 3 #'Typical (80-89 inches)'
    BsmtQual['Fa'] = 2 #'Fair (70-79 inches)'
    BsmtQual['Po'] = 1 #'Poor (<70 inches'
    BsmtQual['NA'] = 0 #'No Basement'

    data.BsmtQual = data.BsmtQual.map(BsmtQual)

    # BsmtCond: Evaluates the general condition of the basement
    data.BsmtCond = data.BsmtCond.map(BsmtQual)

    # GarageQual: Garage quality
    data.GarageQual = data.GarageQual.map(BsmtQual)

    # PoolQC: Pool quality
    data.PoolQC = data.PoolQC.map(BsmtQual)
    
    # BsmtExposure: Refers to walkout or garden level walls
    BsmtExposure = {}
    BsmtExposure['Gd'] = 4 #'Good Exposure'
    BsmtExposure['Av'] = 3 #'Average Exposure (split levels or foyers typically score average or above)'
    BsmtExposure['Mn'] = 2 #'Mimimum Exposure'
    BsmtExposure['No'] = 1 #'No Exposure'
    BsmtExposure['NA'] = 0 #'No Basement'

    data.BsmtExposure = data.BsmtExposure.map(BsmtExposure)

    # BsmtFinType1: Rating of basement finished area
    BsmtFinType1 = {}
    BsmtFinType1['GLQ'] = 6 #'Good Living Quarters'
    BsmtFinType1['ALQ'] = 5 # 'Average Living Quarters'
    BsmtFinType1['BLQ'] = 4 # 'Below Average Living Quarters'
    BsmtFinType1['Rec'] = 3 # 'Average Rec Room'
    BsmtFinType1['LwQ'] = 2 # 'Low Quality'
    BsmtFinType1['Unf'] = 1 # 'Unfinshed'
    BsmtFinType1['NA'] = 0 #'No Basement'

    data.BsmtFinType1 = data.BsmtFinType1.map(BsmtFinType1)

    # BsmtFinType2: Rating of basement finished area (if multiple types)
    data.BsmtFinType2 = data.BsmtFinType2.map(BsmtFinType1)

    #CentralAir: Central air conditioning
    # Since with this transformatio as the same as binarize this feature
    CentralAir = {}
    CentralAir['N'] = 0
    CentralAir['Y'] = 1

    data.CentralAir = data.CentralAir.map(CentralAir)

    # GarageFinish: Interior finish of the garage
    GarageFinish = {}
    GarageFinish['Fin'] = 3 #'Finished'
    GarageFinish['RFn'] = 2 #'Rough Finished'
    GarageFinish['Unf'] = 1 #'Unfinished'
    GarageFinish['NA'] = 0 #'No Garage'
    
    data.GarageFinish = data.GarageFinish.map(GarageFinish)
    
    # Functional: Home functionality
    Functional = {}
    Functional['Typ'] = 7   # Typical Functionality
    Functional['Min1'] = 6  # Minor Deductions 1
    Functional['Min2'] = 5  # Minor Deductions 2
    Functional['Mod'] = 4   # Moderate Deductions
    Functional['Maj1'] = 3  # Major Deductions 1
    Functional['Maj2'] = 2  # Major Deductions 2
    Functional['Sev'] = 1   # Severely Damaged
    Functional['Sal'] = 0   # Salvage only

    data.Functional = data.Functional.map(Functional)
    
    #Street: Type of road access to property
    # Since with this transformatio as the same as binarize this feature
    Street = {}
    Street['Grvl'] = 0 # Gravel 
    Street['Pave'] = 1 # Paved

    data.Street = data.Street.map(Street)


    # Fence: Fence quality
    Fence = {}
    Fence['GdPrv'] = 5 #'Good Privacy'
    Fence['MnPrv'] = 4 #'Minimum Privacy'
    Fence['GdWo'] = 3 #'Good Wood'
    Fence['MnWw'] = 2 #'Minimum Wood/Wire'
    Fence['NA'] = 1 #'No Fence'

    data.Fence = data.Fence.map(Fence)
    #But No Fence has the higest median Sales Price. So I try to use it as categorical
            
    return data

all_data = map_ordinals(all_data)

Feature Engineering: Create New Features:

image

Include pool in the Miscellaneous features

In [61]:
display(all_data.loc[(all_data.PoolArea>0) & (all_data.MiscVal>0), ['MiscFeature', 'MiscVal', 'PoolArea', 'PoolQC']])
MiscFeature MiscVal PoolArea PoolQC
1376 TenC 2000 519 2

Check if we had others "TenC" in the dataset:

In [62]:
display(all_data.loc[(all_data.MiscFeature=='TenC'), ['MiscFeature', 'MiscVal', 'PoolArea', 'PoolQC']])
MiscFeature MiscVal PoolArea PoolQC
1376 TenC 2000 519 2

Since we don't have other "TenC" and others Pools don't coincide with any miscellaneous feature, we include the pools into the Misc Features and drop Pools columns after used it in the creation of others features.

In [63]:
all_data.loc[(all_data.PoolArea>0), ['MiscFeature']] = 'Pool'
all_data.loc[(all_data.PoolArea>0), ['MiscVal']] = all_data.loc[(all_data.PoolArea>0), 
                                                               ['MiscVal', 'PoolArea']].\
                                                                apply(lambda x: (x.MiscVal + x.PoolArea), axis=1)

Points Review

Since there are many punctuation characteristics in our base, and I believe that each person has a specific preference depending on the stage and moment of life, I believe that these variables have importance, but they present a lot of variation and bias.

So, it is important to consider a general score calculated from all the points that are agreed upon. But then you'll say, "What do you mean, you did not know that the base has general scores for condition and quality, you're dumb!". image "Patience young grasshopper", I know of them and I believe that there are some criteria for them, but they sum up a very narrow and discreet spectrum, and frankly who accepts criteria that express a point of view and are not simply natural laws without questioning them. So building the metric will give us a counterpoint based on the different grades, and it is not surprising that it is better for our model than all the grades, even the overall.

And if you still doubt, I put charts and correlation numbers to help you understand the benefits, and of course, then you can also question my own criteria and establish yours to calculate your overall score.

In [64]:
all_data['TotalExtraPoints'] = all_data.HeatingQC + all_data.PoolQC + all_data.FireplaceQu  + all_data.KitchenQual
all_data['TotalPoints'] =  (all_data.ExterQual + all_data.FireplaceQu + all_data.GarageQual + all_data.KitchenQual +
                            all_data.BsmtQual + all_data.BsmtExposure + all_data.BsmtFinType1 + all_data.PoolQC + 
                            all_data.ExterCond + all_data.BsmtCond + all_data.GarageCond + all_data.OverallCond +
                            all_data.BsmtFinType2 + all_data.HeatingQC ) + all_data.OverallQual**2
                         
df = all_data.loc[(all_data.SalePrice>0), ['TotalPoints', 'TotalExtraPoints', 'OverallQual', 'OverallCond', 'ExterQual', 'ExterCond', 'BsmtQual', 
               'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'HeatingQC', 'PoolQC', 'KitchenQual', 
               'FireplaceQu', 'GarageQual', 'GarageCond', 'SalePrice']]
In [65]:
fig = plt.figure(figsize=(20,10))
fig1 = fig.add_subplot(231); sns.regplot(x='BsmtExposure', y='SalePrice', data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.BsmtExposure.corr(df['SalePrice'])))

fig2 = fig.add_subplot(232); sns.regplot(x='BsmtFinType1', y='SalePrice', data=df);
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.BsmtFinType1.corr(df['SalePrice'])))

fig3 = fig.add_subplot(233); sns.regplot(x='BsmtFinType2', y='SalePrice', data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.BsmtFinType2.corr(df['SalePrice'])))

fig4 = fig.add_subplot(234); sns.regplot(x='GarageQual', y='SalePrice', data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.GarageQual.corr(df['SalePrice'])))

fig5 = fig.add_subplot(235); sns.regplot(x='GarageCond', y='SalePrice', data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.GarageCond.corr(df['SalePrice'])))

fig6 = fig.add_subplot(236); sns.regplot(x='TotalPoints', y='SalePrice', data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.TotalPoints.corr(df['SalePrice'])))
plt.show()
In [66]:
fig = plt.figure(figsize=(20,10))
fig1 = fig.add_subplot(231); sns.regplot(x='OverallQual', y='SalePrice', data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.OverallQual.corr(df['SalePrice'])))

fig2 = fig.add_subplot(232); sns.regplot(x='OverallCond', y='SalePrice', data=df);
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.OverallCond.corr(df['SalePrice'])))

fig3 = fig.add_subplot(233); sns.regplot(x='ExterQual', y='SalePrice', data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.ExterQual.corr(df['SalePrice'])))

fig4 = fig.add_subplot(234); sns.regplot(x='ExterCond', y='SalePrice', data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.ExterCond.corr(df['SalePrice'])))

fig5 = fig.add_subplot(235); sns.regplot(x='BsmtQual', y='SalePrice', data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.BsmtQual.corr(df['SalePrice'])))

fig6 = fig.add_subplot(236); sns.regplot(x='BsmtCond', y='SalePrice', data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.BsmtCond.corr(df['SalePrice'])))
plt.show()
In [67]:
fig = plt.figure(figsize=(20,10))
fig1 = fig.add_subplot(231); sns.regplot(x='HeatingQC', y='SalePrice', data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.HeatingQC.corr(df['SalePrice'])))

fig2 = fig.add_subplot(232); sns.regplot(x='PoolQC', y='SalePrice', data=df);
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.PoolQC.corr(df['SalePrice'])))

fig3 = fig.add_subplot(233); sns.regplot(x='KitchenQual', y='SalePrice', data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.KitchenQual.corr(df['SalePrice'])))

fig4 = fig.add_subplot(234); sns.regplot(x='FireplaceQu', y='SalePrice', data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.FireplaceQu.corr(df['SalePrice'])))

fig5 = fig.add_subplot(235); sns.regplot(x='TotalExtraPoints', y='SalePrice', data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.TotalExtraPoints.corr(df['SalePrice'])))

plt.show()
In [68]:
all_data['GarageArea_x_Car'] = all_data.GarageArea * all_data.GarageCars

all_data['TotalBsmtSF_x_Bsm'] = all_data.TotalBsmtSF * all_data['1stFlrSF']

# We don´t have a feature with all construct area, maybe it is an interesting feature to create.
all_data['ConstructArea'] = (all_data.TotalBsmtSF + all_data.WoodDeckSF + all_data.GrLivArea +
                             all_data.OpenPorchSF + all_data.TSsnPorch + all_data.ScreenPorch + all_data.EnclosedPorch +
                             all_data.MasVnrArea + all_data.GarageArea + all_data.PoolArea )

#all_data['TotalArea'] = all_data.ConstructArea + all_data.LotArea

all_data['Garage_Newest'] = all_data.YearBuilt > all_data.GarageYrBlt
all_data.Garage_Newest =  all_data.Garage_Newest.apply(lambda x: 1 if x else 0)

all_data['TotalPorchSF'] = all_data.OpenPorchSF + all_data.EnclosedPorch + all_data.TSsnPorch + all_data.ScreenPorch + all_data.WoodDeckSF
all_data.EnclosedPorch = all_data.EnclosedPorch.apply(lambda x: 1 if x else 0)

all_data['LotAreaMultSlope'] = all_data.LotArea * all_data.LandSlope


all_data['BsmtSFPoints'] = (all_data.BsmtQual**2 + all_data.BsmtCond + all_data.BsmtExposure + 
                            all_data.BsmtFinType1 + all_data.BsmtFinType2)


all_data['BsmtSFMultPoints'] = all_data.TotalBsmtSF * (all_data.BsmtQual**2 + all_data.BsmtCond + all_data.BsmtExposure + 
                                                       all_data.BsmtFinType1 + all_data.BsmtFinType2)

all_data['TotBathrooms'] = all_data.FullBath + (all_data.HalfBath*0.5) + all_data.BsmtFullBath + (all_data.BsmtHalfBath*0.5)
all_data.FullBath = all_data.FullBath.apply(lambda x: 1 if x else 0)
all_data.HalfBath = all_data.HalfBath.apply(lambda x: 1 if x else 0)
all_data.BsmtFullBath = all_data.BsmtFullBath.apply(lambda x: 1 if x else 0)
all_data.BsmtHalfBath = all_data.BsmtHalfBath.apply(lambda x: 1 if x else 0)

One Hot Encode Categorical Features

You might now be wondering, if we have a data set wheres all categorical data already transformed and mapped them into numeric representations, why would we need more levels of encoding again? image To traditional people and liberals, I am not discussing a question of gender or option, but rather that If we directly fed these transformed numeric representations of categorical features into any algorithm, the model will essentially try to interpret these as raw numeric features and hence the notion of magnitude will be wrongly introduced in the system.

There are several schemes and strategies where dummy features are created for each unique value or label out of all the distinct categories in any feature, like one hot encoding, dummy coding, effect coding, and feature hashing schemes.

In one hot encoding strategy we considering have numeric representation of any categorical feature with m labels, the one hot encoding scheme, encodes or transforms the feature into m binary features, which can only contain a value of 1 or 0. Each observation in the categorical feature is thus converted into a vector of size m with only one of the values as 1 (indicating it as active).

From sklearn in preprocessing you can use the LabelEncoder to create a cod map for the category feature than use the OneHotEncoder to apply the one hot encode strategy above then. This is interesting for most real cases where your model will be applied to data that will be fed and updated continuously in a pipeline.

As our case is restricted to the data provided and it fits on a pandas data frame, we will make use of the function get_dummies from pandas, but attention this function does not transform the data to a vector as in the case of the previous or as in R.

The dummy coding scheme is similar to the one hot encoding scheme, except in the case of dummy coding scheme, when applied on a categorical feature with m distinct labels, we get m-1 binary features. Thus each value of the categorical variable gets converted into a vector of size m-1. The extra feature is completely disregarded and thus if the category values range from {0, 1, ..., m-1} the 0th or the m-1th feature is usually represented by a vector of all zeros (0).

In [69]:
def one_hot_encode(df):
    categorical_cols = df.select_dtypes(include=['object']).columns

    print(len(categorical_cols), "categorical columns")
    print(categorical_cols)
    # Remove special charactres and withe spaces. 
    for col in categorical_cols:
        df[col] = df[col].str.replace('\W', '').str.replace(' ', '_') #.str.lower()

    dummies = pd.get_dummies(df[categorical_cols], columns = categorical_cols).columns
    df = pd.get_dummies(df, columns = categorical_cols)

    print("Total Columns:",len(df.columns))
    print(df.info())
    
    return df, dummies

# Correct Categorical from int to str types
all_data.MSSubClass = all_data.MSSubClass.astype('str')
all_data.MoSold = all_data.MoSold.astype('str')

all_data, dummies = one_hot_encode(all_data)
23 categorical columns
Index(['MSSubClass', 'MSZoning', 'Alley', 'LandContour', 'LotConfig',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'Foundation', 'Heating', 'Electrical', 'GarageType', 'MiscFeature',
       'MoSold', 'SaleType', 'SaleCondition'],
      dtype='object')
Total Columns: 259
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2908 entries, 0 to 2907
Columns: 259 entries, LotFrontage to SaleCondition_Partial
dtypes: float64(15), int64(52), uint8(192)
memory usage: 2.0 MB
None

Removing Dummies with none observations in train or test datasets

This is such a simple action, we often find it to be obvious, but note that few books or articles make them as standard. In fact, it does not make sense to use categorical for you to train your model if there is no record in one of the training set or test, do not you agree? image

In [70]:
# Find Dummies with all test observatiosn are equal to 0
ZeroTest = all_data[dummies][ntrain:].sum()==0
all_data.drop(dummies[ZeroTest], axis=1, inplace=True)
print('Dummins in test dataset with all observatios equal to 0:',len(dummies[ZeroTest]),'of \n',dummies[ZeroTest],'\n')
dummies = dummies.drop(dummies[ZeroTest])

# Find dummies with all training observatiosn are equal to 0
ZeroTest = all_data[dummies][:ntrain].sum()==0
all_data.drop(dummies[ZeroTest], axis=1, inplace=True)
print('Dummins in trainig dataset with all observatios equal to 0:',len(dummies[ZeroTest]),'of \n',dummies[ZeroTest],'\n')
dummies = dummies.drop(dummies[ZeroTest])

del ZeroTest
Dummins in test dataset with all observatios equal to 0: 13 of 
 Index(['Condition2_RRAe', 'Condition2_RRAn', 'Condition2_RRNn',
       'HouseStyle_25Fin', 'RoofMatl_Membran', 'RoofMatl_Metal',
       'RoofMatl_Roll', 'Exterior1st_ImStucc', 'Exterior1st_Stone',
       'Exterior2nd_Other', 'Heating_Floor', 'Heating_OthW', 'Electrical_Mix'],
      dtype='object') 

Dummins in trainig dataset with all observatios equal to 0: 1 of 
 Index(['MSSubClass_150'], dtype='object') 

Transform Years to Ages and Create Flags to New and Remod

Instead of falling into the discussion of whether years are ordinal or not, why not work with age? image

In [71]:
def AgeYears(feature): 
    return feature.apply(lambda x: 0 if x==0 else (2011 - x))

all_data.YearBuilt = AgeYears(all_data.YearBuilt)
all_data.YearRemodAdd = AgeYears(all_data.YearRemodAdd)
all_data.GarageYrBlt = AgeYears(all_data.GarageYrBlt) 
all_data.YrSold =  AgeYears(all_data.YrSold) 

Altogether, there are 3 variables that are relevant with regards to the Age of a house; YearBlt, YearRemodAdd, and YearSold. YearRemodAdd defaults to YearBuilt if there has been no Remodeling/Addition. I will use YearRemodeled and YearSold to determine the Age. However, as parts of old constructions will always remain and only parts of the house might have been renovated, I will also introduce a Remodeled Yes/No variable. This should be seen as some sort of penalty parameter that indicates that if the Age is based on a remodeling date, it is probably worth less than houses that were built from scratch in that same year.

In [72]:
all_data['Remod'] = 2
all_data.loc[(all_data.YearBuilt==all_data.YearRemodAdd), ['Remod']] = 0
all_data.loc[(all_data.YearBuilt!=all_data.YearRemodAdd), ['Remod']] = 1

all_data.Age = all_data.YearRemodAdd - all_data.YrSold # sice I convert both to age

all_data["IsNew"] = 2
all_data.loc[(all_data.YearBuilt==all_data.YrSold), ['IsNew']] = 1
all_data.loc[(all_data.YearBuilt!=all_data.YrSold), ['IsNew']] = 0

Check for any correlations between features

image To quantify the linear relationship between the features, I will now create a correlation matrix.

The correlation matrix is identical to a covariance matrix computed from standardized data. The correlation matrix is a square matrix that contains the Pearson product-moment correlation coefficients (often abbreviated as Pearson's r), which measure the linear dependence between pairs of features: image Pearson's correlation coefficient can simply be calculated as the covariance between two features x and y (numerator) divided by the product of their standard deviations (denominator): image The covariance between standardized features is in fact equal to their linear correlation coefficient. Use NumPy's corrcoef and seaborn's heatmap functions to plot the correlation matrix array as a heat map.

To fit a linear regression model, we are interested in those features that have a high correlation with our target variable. So, I will make a zoom in these features in order of their correlation with SalePrice.

In [73]:
corr = all_data[all_data.SalePrice>1].corr()
top_corr_cols = corr[abs((corr.SalePrice)>=.26)].SalePrice.sort_values(ascending=False).keys()
top_corr = corr.loc[top_corr_cols, top_corr_cols]
dropSelf = np.zeros_like(top_corr)
dropSelf[np.triu_indices_from(dropSelf)] = True
plt.figure(figsize=(20, 20))
sns.heatmap(top_corr, cmap=sns.diverging_palette(220, 10, as_cmap=True), annot=True, fmt=".2f", mask=dropSelf)
sns.set(font_scale=0.5)
plt.show()
del corr, dropSelf, top_corr

We can see that our target variable SalePrice shows the largest correlation with the OverallQual variable (0.79), followed by. GrLivArea (0.71). This seems to make sense, since in fact we expect the overall quality and size of the living area to have a greater influence on our value judgments about a property.

From the graph above, it also becomes clear the multicollinearity is an issue.

  • The correlation between GarageCars and GarageArea is very high (0.88), and has very close correlation with the SalePrice.
  • From total square feet of basement area (TotalBsmtSF) and first Floor square feet (1stFlrSF), we found 0.81 of correlation and same correlation with sale price (0.61).
  • Original construction date (YearBuilt) has a little more correlation with price (0.52) than GarageYrBlt (0.49), and a high correlation between them (0.83)
  • 0.83 is the correlation between total rooms above grade not include bathrooms (TotRmsAbvGrd) and GrLivArea, but TotRmsAbvGrd has only 0.51 of correlation with sale price.

Let's see their distributions and type of relation curve between the 10th features with largest correlation with sales price,

Drop the features with highest correlations to other Features:

Colinearity is the state where two variables are highly correlated and contain similar information about the variance within a given dataset. And as you see above, it is easy to find highest colinearities. image You should always be concerned about the collinearity, regardless of the model/method being linear or not, or the main task being prediction or classification.

Assume a number of linearly correlated covariates/features present in the data set and Random Forest as the method. Obviously, random selection per node may pick only (or mostly) collinear features which may/will result in a poor split, and this can happen repeatedly, thus negatively affecting the performance.

Now, the collinear features may be less informative of the outcome than the other (non-collinear) features and as such they should be considered for elimination from the feature set anyway. However, assume that the features are ranked high in the 'feature importance' list produced by RF. As such they would be kept in the data set unnecessarily increasing the dimensionality. So, in practice, I'd always, as an exploratory step (out of many related) check the pairwise association of the features, including linear correlation.

In [74]:
all_data.drop(['FireplaceQu', 'BsmtSFPoints', 'TotalBsmtSF', 'GarageArea', 'GarageCars', 'OverallQual', 'GrLivArea', 
               'TotalBsmtSF_x_Bsm', '1stFlrSF', 'PoolArea', 'LotArea', 'SaleCondition_Partial', 'Exterior1st_VinylSd',
               'GarageCond', 'HouseStyle_2Story', 'BsmtSFMultPoints', 'ScreenPorch', 'LowQualFinSF', 'BsmtFinSF2',
               'TSsnPorch'], axis=1, inplace=True) 
In [75]:
corr = all_data[all_data.SalePrice>1].corr()
top_corr_cols = corr[abs((corr.SalePrice)>=.26)].SalePrice.sort_values(ascending=False).keys()
top_corr = corr.loc[top_corr_cols, top_corr_cols]
dropSelf = np.zeros_like(top_corr)
dropSelf[np.triu_indices_from(dropSelf)] = True
plt.figure(figsize=(20, 20))
sns.heatmap(top_corr, cmap=sns.diverging_palette(220, 10, as_cmap=True), annot=True, fmt=".2f", mask=dropSelf)
sns.set(font_scale=0.5)
plt.show()
del corr, dropSelf, top_corr
In [76]:
sns.set(font_scale=1.0)
g = sns.pairplot(all_data.loc[all_data.SalePrice>0, top_corr_cols])

From the pair plot above we note some points the need attention, like:

  • We can confirm the treatment of some outliers, most of then is area features.
    • So, we confirm the reduce with a little cut of outliers.
    • On the other hand, if we continue to see some of others outliers we need attention to:
      • Possible use of a robust scaler.
      • Search, evaluate and cut outliers on base of residuals plot.
      • As an alternative to throwing out outliers, we will look at a robust method of regression using the RANdom SAmple Consensus (RANSAC) algorithm, which fits a regression model to a subset of the data.
  • We can see that most of data appears skewed and some of than has peaks long tails. It is suggest a use of box cox transformation
  • The Sale Price appears skewed and has a long right tail.

Identify and treat multicollinearity:

Multicollinearity is more troublesome to detect because it emerges when three or more variables, which are highly correlated, are included within a model, leading to unreliable and unstable estimates of regression coefficients. To make matters worst multicollinearity can emerge even when isolated pairs of variables are not collinear.

To identify, we need start with the coefficient of determination, r2, is the square of the Pearson correlation coefficient r. The coefficient of determination, with respect to correlation, is the proportion of the variance that is shared by both variables. It gives a measure of the amount of variation that can be explained by the model (the correlation is the model). It is sometimes expressed as a percentage (e.g., 36% instead of 0.36) when we discuss the proportion of variance explained by the correlation. However, you should not write r2 = 36%, or any other percentage. You should write it as a proportion (e.g., r2 = 0.36). image

Already the Variance Inflation Factor (VIF) is a measure of collinearity among predictor variables within a multiple regression. It is may be calculated for each predictor by doing a linear regression of that predictor on all the other predictors, and then obtaining the R2 from that regression. It is calculated by taking the the ratio of the variance of all a given model's betas divide by the variance of a single beta if it were fit alone [1/(1-R2)]. Thus, a VIF of 1.8 tells us that the variance (the square of the standard error) of a particular coefficient is 80% larger than it would be if that predictor was completely uncorrelated with all the other predictors. The VIF has a lower bound of 1 but no upper bound. Authorities differ on how high the VIF has to be to constitute a problem (e.g.: 2.50 (R2 equal to 0.6), sometimes 5 (R2 equal to .8), or greater than 10 (R2 equal to 0.9) and so on).

But there are several situations in which multicollinearity can be safely ignored:

  • Interaction terms and higher-order terms (e.g., squared and cubed predictors) are correlated with main effect terms because they include the main effects terms. Ops! Sometimes we use polynomials to solve problems, indeed! But keep calm, in these cases, standardizing the predictors can removed the multicollinearity.
  • Indicator, like dummy or one-hot-encode, that represent a categorical variable with three or more categories. If the proportion of cases in the reference category is small, the indicator will necessarily have high VIF's, even if the categorical is not associated with other variables in the regression model. But, you need check if some dummy is collinear or has multicollinearity with other features outside of their dummies.
  • *Control feature if the feature of interest do not have high VIF's. Here's the thing about multicollinearity: it's only a problem for the features that are collinear. It increases the standard errors of their coefficients, and it may make those coefficients unstable in several ways. But so long as the collinear feature are only used as control feature, and they are not collinear with your feature of interest, there's no problem. The coefficients of the features of interest are not affected, and the performance of the control feature as controls is not impaired.

So, generally, we could run the same model twice, once with severe multicollinearity and once with moderate multicollinearity. This provides a great head-to-head comparison and it reveals the classic effects of multicollinearity. However, when standardizing your predictors doesn't work, you can try other solutions such as:

  • removing highly correlated predictors
  • linearly combining predictors, such as adding them together
  • running entirely different analyses, such as partial least squares regression or principal components analysis

When considering a solution, keep in mind that all remedies have potential drawbacks. If you can live with less precise coefficient estimates, or a model that has a high R-squared but few significant predictors, doing nothing can be the correct decision because it won't impact the fit.

Given the potential for correlation among the predictors, we'll have display the variance inflation factors (VIF), which indicate the extent to which multicollinearity is present in a regression analysis. Hence such variables need to be removed from the model. Deleting one variable at a time and then again checking the VIF for the model is the best way to do this.

So, I start the analysis already having removed the features with he highest collinearities and run VIF.

In [77]:
all_data.rename(columns={'2ndFlrSF':'SndFlrSF'}, inplace=True)

def VRF(predict, data, y):
   
    scale = StandardScaler(with_std=False)
    df = pd.DataFrame(scale.fit_transform(data), columns= cols)
    features = "+".join(cols)
    df['SalePrice'] = y.values

    # get y and X dataframes based on this regression:
    y, X = dmatrices(predict + ' ~' + features, data = df, return_type='dataframe')

   # Calculate VIF Factors
    # For each X, calculate VIF and save in dataframe
    vif = pd.DataFrame()
    vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif["features"] = X.columns

    # Inspect VIF Factors
    display(vif.sort_values('VIF Factor'))
    return vif

# Remove the higest correlations and run a multiple regression
cols = all_data.columns
cols = cols.drop(['SalePrice'])
vif = VRF('SalePrice', all_data.loc[all_data.SalePrice>0, cols], all_data.SalePrice[all_data.SalePrice>0])
VIF Factor features
0 1.000 Intercept
41 1.267 YrSold
39 1.462 Fence
22 1.529 BsmtHalfBath
10 1.530 ExterCond
29 1.553 Functional
120 1.572 Condition2_PosN
2 1.582 Street
23 1.644 FullBath
3 1.733 LotShape
119 1.755 Condition2_PosA
34 1.828 PavedDrive
143 1.924 Exterior1st_AsphShn
13 2.260 BsmtExposure
37 2.263 EnclosedPorch
225 2.289 Remod
226 2.344 IsNew
1 2.351 LotFrontage
4 2.405 LandSlope
144 2.570 Exterior1st_BrkComm
5 2.580 OverallCond
16 2.589 BsmtFinType2
19 2.616 CentralAir
48 2.619 LotAreaMultSlope
116 2.705 Condition2_Artery
36 2.718 OpenPorchSF
129 2.765 HouseStyle_25Unf
24 2.799 HalfBath
221 3.094 SaleCondition_AdjLand
32 3.178 GarageFinish
... ... ...
123 inf BldgType_Duplex
124 inf BldgType_Twnhs
125 inf BldgType_TwnhsE
132 inf RoofStyle_Flat
133 inf RoofStyle_Gable
134 inf RoofStyle_Gambrel
135 inf RoofStyle_Hip
136 inf RoofStyle_Mansard
108 inf Condition1_Feedr
107 inf Condition1_Artery
106 inf Neighborhood_Veenker
105 inf Neighborhood_Timber
89 inf Neighborhood_Edwards
90 inf Neighborhood_Gilbert
91 inf Neighborhood_IDOTRR
92 inf Neighborhood_MeadowV
93 inf Neighborhood_Mitchel
94 inf Neighborhood_NAmes
95 inf Neighborhood_NPkVill
87 inf Neighborhood_CollgCr
96 inf Neighborhood_NWAmes
98 inf Neighborhood_NridgHt
99 inf Neighborhood_OldTown
100 inf Neighborhood_SWISU
101 inf Neighborhood_Sawyer
102 inf Neighborhood_SawyerW
103 inf Neighborhood_Somerst
104 inf Neighborhood_StoneBr
97 inf Neighborhood_NoRidge
113 inf Condition1_RRAn

227 rows × 2 columns

From the results, I conclude that we need work in the features where's the VIF stated as inf. This is done to prevent multicollinearity, or the dummy variable trap caused by including a dummy variable for every single category. let's try remove one or two dummies for every category and check if it solves the other dummies from its category:

In [78]:
# Remove one feature with VIF on Inf from the same category and run a multiple regression
cols = cols.drop(['Condition1_PosN', 'Neighborhood_NWAmes', 'Exterior1st_CBlock', 'BldgType_1Fam', 'RoofStyle_Flat',
                  'MSZoning_Call', 'Alley_Grvl', 'LandContour_Bnk', 'LotConfig_Corner', 'GarageType_2Types', 'MSSubClass_45',
                  'MasVnrType_BrkCmn', 'Foundation_CBlock', 'MiscFeature_Gar2', 'SaleType_COD', 'Exterior2nd_CBlock'])

vif = VRF('SalePrice', all_data.loc[all_data.SalePrice>0, cols], all_data.SalePrice[all_data.SalePrice>0])
VIF Factor features
0 1.000 Intercept
165 1.191 Foundation_Wood
197 1.196 SaleType_Con
202 1.216 SaleType_Oth
75 1.222 LotConfig_FR3
107 1.235 Condition1_RRNe
196 1.256 SaleType_CWD
41 1.265 YrSold
78 1.273 Neighborhood_Blueste
199 1.277 SaleType_ConLI
200 1.359 SaleType_ConLw
74 1.411 LotConfig_FR2
164 1.413 Foundation_Stone
100 1.454 Neighborhood_Veenker
39 1.461 Fence
22 1.529 BsmtHalfBath
10 1.530 ExterCond
29 1.553 Functional
198 1.572 SaleType_ConLD
113 1.572 Condition2_PosN
2 1.582 Street
23 1.644 FullBath
104 1.681 Condition1_PosA
108 1.707 Condition1_RRNn
3 1.733 LotShape
112 1.755 Condition2_PosA
76 1.782 LotConfig_Inside
34 1.827 PavedDrive
73 1.873 LotConfig_CulDSac
134 1.924 Exterior1st_AsphShn
... ... ...
51 88.633 MSSubClass_160
174 96.535 GarageType_Attchd
50 108.628 MSSubClass_120
170 109.330 Electrical_FuseA
156 112.357 Exterior2nd_WdSdng
126 113.294 RoofStyle_Hip
149 118.663 Exterior2nd_HdBoard
124 121.892 RoofStyle_Gable
57 132.145 MSSubClass_50
151 138.073 Exterior2nd_MetalSd
173 143.023 Electrical_SBrkr
155 185.699 Exterior2nd_VinylSd
58 236.718 MSSubClass_60
54 320.428 MSSubClass_20
183 324.344 MiscFeature_Shed
180 424.385 MiscFeature_NA
188 inf MoSold_2
115 inf BldgType_Duplex
63 inf MSSubClass_90
184 inf MoSold_1
185 inf MoSold_10
186 inf MoSold_11
187 inf MoSold_12
195 inf MoSold_9
194 inf MoSold_8
193 inf MoSold_7
192 inf MoSold_6
191 inf MoSold_5
189 inf MoSold_3
190 inf MoSold_4

211 rows × 2 columns

Excellent, we are in the good path, but ... image ... we need continuing work on the remaining features to reduce the VIF, lets to continue and try to get less then 50.

In [79]:
# Remove one feature with highest VIF from the same category and run a multiple regression
cols = cols.drop(['PoolQC', 'BldgType_TwnhsE', 'BsmtFinSF1', 'BsmtUnfSF', 'Electrical_SBrkr', 'Exterior1st_MetalSd',
                  'Exterior2nd_VinylSd', 'GarageQual', 'GarageType_Attchd', 'HouseStyle_1Story', 'MasVnrType_None',
                  'MiscFeature_NA', 'MSZoning_RL', 'RoofStyle_Gable', 'SaleCondition_Normal', 'MoSold_10',
                  'SaleType_New', 'SndFlrSF', 'TotalPorchSF', 'WoodDeckSF', 'BldgType_Duplex', 'MSSubClass_90'])
              
vif = VRF('SalePrice', all_data.loc[all_data.SalePrice>0, cols], all_data.SalePrice[all_data.SalePrice>0])
VIF Factor features
0 1.000 Intercept
186 1.134 SaleCondition_Family
177 1.151 SaleType_Con
181 1.162 SaleType_Oth
179 1.167 SaleType_ConLI
149 1.169 Foundation_Wood
176 1.175 SaleType_CWD
66 1.213 LotConfig_FR3
35 1.229 YrSold
98 1.229 Condition1_RRNe
180 1.237 SaleType_ConLw
69 1.255 Neighborhood_Blueste
57 1.270 MSZoning_RH
164 1.299 MiscFeature_Shed
183 1.303 SaleCondition_Abnorml
178 1.333 SaleType_ConLD
112 1.354 RoofStyle_Gambrel
148 1.354 Foundation_Stone
15 1.381 BsmtFinType2
65 1.386 LotConfig_FR2
34 1.389 MiscVal
163 1.391 MiscFeature_Pool
33 1.421 Fence
159 1.425 GarageType_CarPort
154 1.428 Electrical_FuseA
113 1.434 RoofStyle_Hip
91 1.440 Neighborhood_Veenker
19 1.448 BsmtHalfBath
26 1.454 Functional
2 1.463 Street
... ... ...
126 7.810 Exterior1st_Plywood
52 8.180 MSSubClass_70
151 8.272 Heating_GasW
37 8.449 TotalPoints
48 8.898 MSSubClass_30
138 9.332 Exterior2nd_Plywood
44 9.874 MSSubClass_160
43 10.254 MSSubClass_120
116 10.381 RoofMatl_CompShg
39 11.047 ConstructArea
128 11.499 Exterior1st_WdSdng
36 11.736 TotalExtraPoints
141 11.977 Exterior2nd_WdSdng
40 11.992 Garage_Newest
94 12.091 Condition1_Norm
150 12.884 Heating_GasA
135 13.096 Exterior2nd_HdBoard
125 13.314 Exterior1st_HdBoard
54 13.814 MSSubClass_80
111 14.071 HouseStyle_SLvl
107 14.163 HouseStyle_15Fin
161 15.071 GarageType_NA
6 17.009 YearBuilt
51 25.064 MSSubClass_60
124 25.281 Exterior1st_CemntBd
134 25.316 Exterior2nd_CmentBd
50 25.578 MSSubClass_50
47 31.834 MSSubClass_20
105 33.733 BldgType_2fmCon
46 35.043 MSSubClass_190

189 rows × 2 columns

Defining Categorical and Boolean Data as unit8 types

Remember we used the panda for one hot encode and some categorical ones had already been provided or created as numbers. So, in order that the models do not make inappropriate use of features transformed into numbers and apply only calculations relevant to categorical, we have to transform their type into category type or in unit8 to leave some calculations. image

In [80]:
# Reserve a copy for futer analysis:
df_copy = all_data[all_data.SalePrice>0].copy()

all_data.CentralAir = all_data.CentralAir.astype('uint8')
all_data.Garage_Newest = all_data.Garage_Newest.astype('uint8')
all_data.EnclosedPorch = all_data.EnclosedPorch.astype('uint8')
all_data.FullBath = all_data.FullBath.astype('uint8')
all_data.HalfBath = all_data.HalfBath.astype('uint8')
all_data.BsmtFullBath = all_data.BsmtFullBath.astype('uint8')
all_data.BsmtHalfBath = all_data.BsmtHalfBath.astype('uint8')
all_data.Remod = all_data.Remod.astype('uint8')
all_data.IsNew = all_data.IsNew.astype('uint8') 
all_data.Street = all_data.Street.astype('uint8') # orinal
all_data.PavedDrive = all_data.PavedDrive.astype('uint8') # ordinal
all_data.Functional = all_data.Functional.astype('uint8') # ordinal
all_data.LandSlope = all_data.LandSlope.astype('uint8') # ordinal

'''
for feat in cols:
    if all_data[feat].dtype=='uint8':
        all_data[feat] = all_data[feat].astype('category')
'''
Out[80]:
"\nfor feat in cols:\n    if all_data[feat].dtype=='uint8':\n        all_data[feat] = all_data[feat].astype('category')\n"

Box cox transformation of highly skewed features

A Box Cox transformation is a way to transform non-normal data distribution into a normal shape. image Why does this matter?

  • Model bias and spurious interactions: If you are performing a regression or any statistical modeling, this asymmetrical behavior may lead to a bias in the model. If a factor has a significant effect on the average, because the variability is much larger, many factors will seem to have a stronger effect when the mean is larger. This is not due, however, to a true factor effect but rather to an increased amount of variability that affects all factor effect estimates when the mean gets larger. This will probably generate spurious interactions due to a non-constant variation, resulting in a very complex model with many spurious and unrealistic interactions.
  • Normality is an important assumption for many statistical techniques: such as individuals control charts, Cp/Cpk analysis, t-tests and analysis of variance (ANOVA). A substantial departure from normality will bias your capability estimates.

One solution to this is to transform your data into normality using a Box-Cox transformation means that you are able to run a broader number of tests.

At the core of the Box Cox transformation is an exponent, lambda (λ), which varies from -5 to 5. All values of λ are considered and the optimal value for your data is selected; The 'optimal value' is the one which results in the best approximation of a normal distribution curve. The transformation of Y has the form:

The scipy implementation proceeded with this formula, then you need before take care of negatives values if you have. A common technique for handling negative values is to add a constant value to the data prior to applying the log transform. The transformation is therefore log(Y+a) where a is the constant. Some people like to choose a so that min(Y+a) is a very small positive number (like 0.001). Others choose a so that min(Y+a) = 1. For the latter choice, you can show that a = b – min(Y), where b is either a small number or is 1. This test only works for positive data. However, Box and Cox did propose a second formula that can be used for negative y-values, not implemented in scipy: The formula are deceptively simple. Testing all possible values by hand is unnecessarily labor intensive.

Common Box-Cox Transformations

Lambda value (λ) Transformed data (Y')
-3 Y**-3 = 1/Y**3
-2 Y**-2 = 1/Y**2
-1 Y**-1 = 1/Y
-0.5 Y**-0.5 = 1/(√(Y))
0 log(Y)(*)
0.5 Y0.5 = √(Y)
1 Y**1 = Y
2 Y**2
3 Y**3

(*)Note: the transformation for zero is log(0), otherwise all data would transform to Y**0 = 1. The transformation doesn't always work well, so make sure you check your data after the transformation with a normal probability plot or if the skew are reduced, tending to zero.

In [81]:
numeric_features = list(all_data.loc[:, cols].dtypes[(all_data.dtypes != "category") & (all_data.dtypes !='uint8')].index)

'''
with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=RuntimeWarning)
'''
skewed_features = all_data[numeric_features].apply(lambda x : skew (x.dropna())).sort_values(ascending=False)

#compute skewness
skewness = pd.DataFrame({'Skew' :skewed_features})   

# Get only higest skewed features
skewness = skewness[abs(skewness) > 0.7]
skewness = skewness.dropna()
print ("There are {} higest skewed numerical features to box cox transform".format(skewness.shape[0]))

l_opt = {}

for feat in skewness.index:
    all_data[feat], l_opt[feat] = boxcox((all_data[feat]+1))

skewed_features2 = all_data[skewness.index].apply(lambda x : skew (x.dropna())).sort_values(ascending=False)

#compute skewness
skewness2 = pd.DataFrame({'New Skew' :skewed_features2})   
display(pd.concat([skewness, skewness2], axis=1).sort_values(by=['Skew'], ascending=False))
There are 18 higest skewed numerical features to box cox transform
Skew New Skew
MiscVal 21.750 4.772
KitchenAbvGr 4.349 -2.059
LotAreaMultSlope 3.494 0.136
BsmtFinType2 3.147 -0.434
MasVnrArea 2.622 0.457
OpenPorchSF 2.529 -0.025
Fence 1.751 1.533
GarageArea_x_Car 1.354 -0.132
ExterCond 1.313 -0.036
LotFrontage 1.231 0.171
BsmtExposure 1.120 -0.030
ConstructArea 1.088 0.022
ExterQual 0.788 -0.023
Fireplaces 0.725 0.112
TotRmsAbvGrd 0.717 -0.000
LotShape -1.247 -0.597
BsmtQual -1.265 0.046
BsmtCond -3.633 0.657

As you can see, we were able at first to bring most the numerical values closer to normal. Maybe you're not satisfied with the results of MiscVal and Kitchener and want to understand if we really need to continue to transform some discrete data. So, let's take a look at the QQ test of these features.

In [82]:
def QQ_plot(data, measure):
    fig = plt.figure(figsize=(12,4))

    #Get the fitted parameters used by the function
    (mu, sigma) = norm.fit(data)

    #Kernel Density plot
    fig1 = fig.add_subplot(121)
    sns.distplot(data, fit=norm)
    fig1.set_title(measure + ' Distribution ( mu = {:.2f} and sigma = {:.2f} )'.format(mu, sigma), loc='center')
    fig1.set_xlabel(measure)
    fig1.set_ylabel('Frequency')

    #QQ plot
    fig2 = fig.add_subplot(122)
    res = probplot(data, plot=fig2)
    fig2.set_title(measure + ' Probability Plot (skewness: {:.6f} and kurtosis: {:.6f} )'.\
                   format(data.skew(), data.kurt()), loc='center')

    plt.tight_layout()
    plt.show()
    
for feat in skewness.index:
    QQ_plot(all_data[feat], ('Boxcox1p of {}'.format(feat)))

As you have seen, really MiscVal and Kitchener really do not seem to be good results, especially MiscVal, but it is a fact that both variables do not look good indifferent to their distribution.

As for the other discrete variables, in addition to having presented significant improvements, they also pass the QQ test and present interesting distributions as we can observe in their respective graphs.

So, we can continue to apply the BoxCox on this features and leave to feature selection algorithms to decide if we continue with some of then or not.

Evaluate Apply Polynomials by Region Plots on the more Correlated Features

In [83]:
fig = plt.figure(figsize=(20,15)) 
ax = fig.add_subplot(331); g = sns.regplot(y='SalePrice', x='ConstructArea', data = all_data[all_data.SalePrice>0], order=3) 
ax = fig.add_subplot(332); g = sns.regplot(y='SalePrice', x='GarageArea_x_Car', data = all_data[all_data.SalePrice>0], order=3)
ax = fig.add_subplot(333); g = sns.regplot(y='SalePrice', x='LotAreaMultSlope', data = all_data[all_data.SalePrice>0], order=3) 
ax = fig.add_subplot(334); g = sns.regplot(y='SalePrice', x='TotalExtraPoints', data = all_data[all_data.SalePrice>0], order=4) 
ax = fig.add_subplot(335); g = sns.regplot(y='SalePrice', x='TotalPoints', data = all_data[all_data.SalePrice>0], order=3) 

plt.tight_layout() 

Evaluating Polynomials Options Performance

One way to account for the violation of linearity assumption is to use a polynomial regression model by adding polynomial terms. image Although we can use polynomial regression to model a nonlinear relationship, it is still considered a multiple linear regression model because of the linear regression coefficients w.

Moreover, as we have seen, some of our features are better when interacting with each other than with just observed ones, but some have a negative effect.

So, let's check it more carefully.

In [84]:
def poly(X, y, feat=''):

    # Initializatin of regression models
    regr = LinearRegression()
    regr = regr.fit(X, y)
    y_lin_fit = regr.predict(X)
    linear_r2 = r2_score(y, regr.predict(X))

    # create polynomial features
    quadratic = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
    cubic = PolynomialFeatures(degree=3, interaction_only=False, include_bias=False)
    fourth = PolynomialFeatures(degree=4, interaction_only=False, include_bias=False)
    fifth = PolynomialFeatures(degree=5, interaction_only=False, include_bias=False)
    X_quad = quadratic.fit_transform(X)
    X_cubic = cubic.fit_transform(X)
    X_fourth = fourth.fit_transform(X)
    X_fifth = fifth.fit_transform(X)

    # quadratic fit
    regr = regr.fit(X_quad, y)
    y_quad_fit = regr.predict(quadratic.fit_transform(X))
    quadratic_r2 = r2_score(y, y_quad_fit)

    # cubic fit
    regr = regr.fit(X_cubic, y)
    y_cubic_fit = regr.predict(cubic.fit_transform(X))
    cubic_r2 = r2_score(y, y_cubic_fit)

    # Fourth fit
    regr = regr.fit(X_fourth, y)
    y_fourth_fit = regr.predict(fourth.fit_transform(X))
    four_r2 = r2_score(y, y_fourth_fit)

    # Fifth fit
    regr = regr.fit(X_fifth, y)
    y_fifth_fit = regr.predict(fifth.fit_transform(X))
    five_r2 = r2_score(y, y_fifth_fit)
    
    if len(feat)>0:
        fig = plt.figure(figsize=(20,5))
        # Plot lowest Polynomials
        fig1 = fig.add_subplot(121)
        plt.scatter(X[feat], y, label='training points', color='lightgray')
        plt.plot(X[feat], y_lin_fit, label='linear (d=1), $R^2=%.3f$' % linear_r2, color='blue', lw=0.5, linestyle=':')
        plt.plot(X[feat], y_quad_fit, label='quadratic (d=2), $R^2=%.3f$' % quadratic_r2, color='red', lw=0.5, linestyle='-')
        plt.plot(X[feat], y_cubic_fit, label='cubic (d=3), $R^2=%.3f$' % cubic_r2,  color='green', lw=0.5, linestyle='--')

        plt.xlabel(feat)
        plt.ylabel('Sale Price')
        plt.legend(loc='upper left')

        # Plot higest Polynomials
        fig2 = fig.add_subplot(122)
        plt.scatter(X[feat], y, label='training points', color='lightgray')
        plt.plot(X[feat], y_lin_fit, label='linear (d=1), $R^2=%.3f$' % linear_r2, color='blue', lw=2, linestyle=':')
        plt.plot(X[feat], y_fifth_fit, label='Fifth (d=5), $R^2=%.3f$' % five_r2, color='yellow', lw=2, linestyle='-')
        plt.plot(X[feat], y_fifth_fit, label='Fourth (d=4), $R^2=%.3f$' % four_r2, color='red', lw=2, linestyle=':')

        plt.xlabel(feat)
        plt.ylabel('Sale Price')
        plt.legend(loc='upper left')
    else:
        # Plot initialisation
        fig = plt.figure(figsize=(20,10))
        ax = fig.add_subplot(121, projection='3d')
        ax.scatter(X.iloc[:, 0], X.iloc[:, 1], y, s=40)

        # make lines of the regressors:
        plt.plot(X.iloc[:, 0], X.iloc[:, 1], y_lin_fit, label='linear (d=1), $R^2=%.3f$' % linear_r2, 
                 color='blue', lw=2, linestyle=':')
        plt.plot(X.iloc[:, 0], X.iloc[:, 1], y_quad_fit, label='quadratic (d=2), $R^2=%.3f$' % quadratic_r2, 
                 color='red', lw=0.5, linestyle='-')
        plt.plot(X.iloc[:, 0], X.iloc[:, 1], y_cubic_fit, label='cubic (d=3), $R^2=%.3f$' % cubic_r2, 
                 color='green', lw=0.5, linestyle='--')
        # label the axes
        ax.set_xlabel(X.columns[0])
        ax.set_ylabel(X.columns[1])
        ax.set_zlabel('Sales Price')
        ax.set_title("Poly up to 3 degree")
        plt.legend(loc='upper left')

    plt.tight_layout()
    plt.show()
In [85]:
y = all_data.SalePrice[all_data.SalePrice>0]
X = all_data.loc[all_data.SalePrice>0, ['ConstructArea']] 
poly(X, y, 'ConstructArea')

X = all_data.loc[all_data.SalePrice>0, ['ConstructArea', 'TotalPoints']] 
poly(X, y)

X = all_data.loc[all_data.SalePrice>0, ['ConstructArea', 'TotalPoints', 'LotAreaMultSlope',  'GarageArea_x_Car']] 
poly(X, y)

As you can see, our third-degree polynomial with only Construct Area provides an improvement of 0.6%, while the one with Construct Area and Total Points has a significant improvement of 10.5% and final R2 of 85,6% in third-degree, but it represents only 0.6% again from without polynomials to a third-degree. We should be aware that adding more and more polynomial features increases the complexity of a model and therefore increases the chance of overfit. Se that our 4 features polynomials present a R2 of 87.2%, a increase of only 1.6% from the polynomial only with 2 features and it represent a increase of 0.9% in the gain (without and with poly).

If you have downloaded this notebook to run it, try to include some other variables with high correlation index, one at a time. You will find that some cause an opposite effect, reducing the accuracy of the model, for example change the TotalPoints by TotalExtraPoints. So, you can see that is better include a controlled polynomial interaction then just include the polynomial above all features in the pipeline.

Create Degree 3 Polynomials Features

As you saw, it is not appropriate to apply the polynomial to all of our variables, so let's create a code that applies the third-degree polynomial only in our previous 4 features and concatenate its result to our data set.

In [86]:
poly_cols = ['ConstructArea', 'TotalPoints', 'LotAreaMultSlope',  'GarageArea_x_Car']

pf = PolynomialFeatures(degree=3, interaction_only=False, include_bias=False)
res = pf.fit_transform(all_data.loc[:, poly_cols])

target_feature_names = [feat.replace(' ','_') for feat in pf.get_feature_names(poly_cols)]
output_df = pd.DataFrame(res, columns = target_feature_names,  index=all_data.index).iloc[:, len(poly_cols):]
print('Polynomial Features included:', output_df.shape[1])
display(output_df.head())
all_data = pd.concat([all_data, output_df], axis=1)
print('Total Features after Polynomial Features included:', all_data.shape[1])
colsP = output_df.columns

del output_df, target_feature_names, res, pf
Polynomial Features included: 30
ConstructArea^2 ConstructArea_TotalPoints ConstructArea_LotAreaMultSlope ConstructArea_GarageArea_x_Car TotalPoints^2 TotalPoints_LotAreaMultSlope TotalPoints_GarageArea_x_Car LotAreaMultSlope^2 LotAreaMultSlope_GarageArea_x_Car GarageArea_x_Car^2 ... TotalPoints^3 TotalPoints^2_LotAreaMultSlope TotalPoints^2_GarageArea_x_Car TotalPoints_LotAreaMultSlope^2 TotalPoints_LotAreaMultSlope_GarageArea_x_Car TotalPoints_GarageArea_x_Car^2 LotAreaMultSlope^3 LotAreaMultSlope^2_GarageArea_x_Car LotAreaMultSlope_GarageArea_x_Car^2 GarageArea_x_Car^3
0 644.889 2310.915 1862.103 1501.104 8281.000 6672.716 5379.101 5376.782 4334.406 3494.111 ... 753571.000 607217.120 489498.214 489287.181 394430.910 317964.069 394260.863 317826.988 256211.062 206540.384
1 635.245 2117.141 1926.779 1364.704 7056.000 6421.560 4548.281 5844.166 4139.323 2931.812 ... 592704.000 539411.078 382055.640 490909.983 347703.144 246272.190 446769.860 316439.450 224128.650 158746.489
2 660.557 2441.625 2069.062 1600.247 9025.000 7647.893 5915.007 6480.916 5012.447 3876.710 ... 857375.000 726549.806 561925.652 615686.976 476182.503 368287.434 521740.490 403522.735 312091.166 241376.476
3 650.361 2346.201 1946.249 1996.519 8464.000 7021.162 7202.510 5824.280 5974.715 6129.035 ... 778688.000 645946.859 662630.937 535833.794 549673.775 563871.227 444491.447 455972.159 467749.404 479830.843
4 786.236 3112.429 2438.275 2502.183 12321.000 9652.265 9905.251 7561.580 7759.769 7963.153 ... 1367631.000 1071401.447 1099482.892 839335.362 861334.352 883909.936 657534.906 674768.904 692454.605 710603.849

5 rows × 30 columns

Total Features after Polynomial Features included: 257

Separate Train, Test Datasets, identifiers and Dependent Variable

In the next steps we will select features, reduce dimensions and run models, so it is important to separate our data sets again in training, test, id and dependent variable.

In [87]:
y_train = (all_data.SalePrice[all_data.SalePrice>0].reset_index(drop=True, inplace=False))

# Data with Polynomials
train = all_data.loc[(all_data.SalePrice>0), cols].reset_index(drop=True, inplace=False)
test = all_data.loc[(all_data.SalePrice==0), cols].reset_index(drop=True, inplace=False)

Select Features

It is important to consider feature selection a part of the model selection process. If you do not, you may inadvertently introduce bias into your models which can result in overfitting.

Compare several feature selection methods, including your new idea, correlation coefficients, backward selection and embedded methods. Use linear and non-linear predictors. Select the best approach with model selection.

Feature selection methods can be used to identify and remove unneeded, irrelevant and redundant attributes from data that do not contribute to the accuracy of a predictive model or may in fact decrease the accuracy of the model.

All of the features we find in the dataset might not be useful in building a machine learning model to make the necessary prediction. Using some of the features might even make the predictions worse. selectFeat Often in data science we have hundreds or even millions of features and we want a way to create a model that only includes the most important features. This has three benefits.

  1. It reduces the variance of the model, and therefore overfitting.
  2. It reduces the complexity of a model and makes it easier to interpret.
  3. It improves the accuracy of a model if the right subset is chosen. 4.Finally, it reduces the computational cost (and time) of training a model. So, an alternative way to reduce the complexity of the model and avoid overfitting is dimensionality reduction via feature selection, which is especially useful for unregularized models. There are two main categories of dimensionality reduction techniques: feature selection and feature extraction. Using feature selection, we select a subset of the original features. In feature extraction, we derive information from the feature set to construct a new feature subspace. image Exist various methodologies and techniques that you can use to subset your feature space and help your models perform better and efficiently. So, let's get started.

Prepare Data to Select Features

Let's create a data set, at first without applying our third-degree polymorph and already robust scaled.

In [88]:
scale = RobustScaler()
# Data without Polynomials
df = pd.DataFrame(scale.fit_transform(train[cols]), columns= cols)

Wrapper Methods

In wrapper methods, we try to use a subset of features and train a model using them. Based on the inferences that we draw from the previous model, we decide to add or remove features from your subset. The problem is essentially reduced to a search problem. image

The two main disadvantages of these methods are :

  • The increasing overfitting risk when the number of observations is insufficient.
  • These methods are usually computationally very expensive.

Backward Elimination

In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features.

We will see below row implementation of backward elimination, one to select by P-values and other based on the accuracy of a model the we submitted to it.

Backward Elimination By P-values

The P-value, or probability value, or asymptotic significance, is a probability value for a given statistical model that, if the null hypothesis is true, a set of statistical observations more commonly known as the statistical summary is greater than or equal in magnitude to the observed results.

The null hypothesis is a general statement that there is no relationship between two measured phenomena.

For example, if the correlation is very small and furthermore, the p-value is high meaning that it is very likely to observe such correlation on a dataset of this size purely by chance.

But you need to be careful how you interpret the statistical significance of a correlation. If your correlation coefficient has been determined to be statistically significant this does not mean that you have a strong association. It simply tests the null hypothesis that there is no relationship. By rejecting the null hypothesis, you accept the alternative hypothesis that states that there is a relationship, but with no information about the strength of the relationship or its importance.

Since removal of different features from the dataset will have different effects on the p-value for the dataset, we can remove different features and measure the p-value in each case. These measured p-values can be used to decide whether to keep a feature or not.

From statsmodels.api, the linear models allows estimation by ordinary least squares (OLS), weighted least squares (WLS), generalized least squares (GLS), and feasible generalized recursive least squares with autocorrelated AR(p) errors.

Next we make the test of a Linear regression to check the result and select features based on its the P-value:

In [89]:
ln_model=sm.OLS(y_train,df)
result=ln_model.fit()
print(result.summary2())
                    Results: Ordinary least squares
=======================================================================
Model:                 OLS               Adj. R-squared:      0.994    
Dependent Variable:    SalePrice         AIC:                 3961.2009
Date:                  2018-10-17 11:26  BIC:                 4953.5832
No. Observations:      1449              Log-Likelihood:      -1792.6  
Df Model:              188               F-statistic:         1390.    
Df Residuals:          1261              Prob (F-statistic):  0.00     
R-squared:             0.995             Scale:               0.79880  
-----------------------------------------------------------------------
                       Coef.  Std.Err.    t     P>|t|   [0.025   0.975]
-----------------------------------------------------------------------
LotFrontage            0.0392   0.0353   1.1124 0.2662  -0.0300  0.1084
Street                -1.6017   0.4800  -3.3368 0.0009  -2.5435 -0.6600
LotShape              -0.0556   0.0611  -0.9096 0.3632  -0.1754  0.0643
LandSlope              0.2857   0.1328   2.1523 0.0316   0.0253  0.5462
OverallCond            0.1716   0.0331   5.1850 0.0000   0.1067  0.2365
YearBuilt             -0.9777   0.1459  -6.6994 0.0000  -1.2640 -0.6914
YearRemodAdd           0.2852   0.0844   3.3785 0.0008   0.1196  0.4508
MasVnrArea            -0.3020   0.2279  -1.3251 0.1854  -0.7491  0.1451
ExterQual              0.0278   0.0824   0.3369 0.7363  -0.1339  0.1895
ExterCond             -0.4540   0.4179  -1.0864 0.2775  -1.2738  0.3658
BsmtQual              -0.3274   0.0635  -5.1559 0.0000  -0.4520 -0.2028
BsmtCond               0.0013   0.0037   0.3659 0.7145  -0.0059  0.0085
BsmtExposure           0.1349   0.0428   3.1517 0.0017   0.0509  0.2188
BsmtFinType1          -0.1680   0.0852  -1.9723 0.0488  -0.3352 -0.0009
BsmtFinType2          -0.2876   0.2585  -1.1127 0.2660  -0.7947  0.2195
HeatingQC             -1.0977   0.1076 -10.1991 0.0000  -1.3088 -0.8865
CentralAir            -0.2143   0.1515  -1.4145 0.1575  -0.5116  0.0829
BsmtFullBath           0.0651   0.1009   0.6446 0.5193  -0.1329  0.2630
BsmtHalfBath           0.3221   0.1240   2.5979 0.0095   0.0789  0.5654
FullBath               0.4932   0.3753   1.3140 0.1891  -0.2431  1.2296
HalfBath               0.1319   0.0792   1.6649 0.0962  -0.0235  0.2874
BedroomAbvGr           0.0626   0.0527   1.1888 0.2347  -0.0407  0.1659
KitchenAbvGr          28.8543   1.6502  17.4850 0.0000  25.6168 32.0918
KitchenQual           -0.1833   0.0776  -2.3614 0.0184  -0.3356 -0.0310
TotRmsAbvGrd           0.1318   0.0732   1.8010 0.0719  -0.0118  0.2754
Functional             0.0672   0.0420   1.6001 0.1098  -0.0152  0.1495
Fireplaces            -1.3732   0.1306 -10.5106 0.0000  -1.6295 -1.1169
GarageYrBlt            0.0640   0.0915   0.6999 0.4841  -0.1155  0.2435
GarageFinish          -0.1267   0.0459  -2.7598 0.0059  -0.2168 -0.0366
PavedDrive             0.0816   0.0638   1.2797 0.2009  -0.0435  0.2068
OpenPorchSF           -0.2650   0.0619  -4.2783 0.0000  -0.3865 -0.1435
EnclosedPorch          0.0017   0.0853   0.0202 0.9839  -0.1657  0.1691
Fence                 -5.0885  16.6280  -0.3060 0.7596 -37.7101 27.5331
MiscVal               -1.4590   2.4136  -0.6045 0.5456  -6.1942  3.2762
YrSold                 0.0487   0.0391   1.2471 0.2126  -0.0279  0.1254
TotalExtraPoints       1.8321   0.2041   8.9770 0.0000   1.4317  2.2325
TotalPoints           -0.0152   0.0894  -0.1700 0.8650  -0.1906  0.1602
GarageArea_x_Car      -0.0405   0.0711  -0.5693 0.5692  -0.1800  0.0990
ConstructArea          0.3809   0.0936   4.0705 0.0000   0.1973  0.5645
Garage_Newest         -0.0942   0.3422  -0.2752 0.7832  -0.7655  0.5772
LotAreaMultSlope       0.0465   0.0415   1.1203 0.2628  -0.0349  0.1279
TotBathrooms           0.1240   0.0414   2.9952 0.0028   0.0428  0.2052
MSSubClass_120         6.8899   0.2375  29.0077 0.0000   6.4240  7.3559
MSSubClass_160         6.4417   0.3007  21.4256 0.0000   5.8519  7.0316
MSSubClass_180         2.7049   0.4952   5.4622 0.0000   1.7334  3.6764
MSSubClass_190         4.7998   1.0071   4.7660 0.0000   2.8240  6.7756
MSSubClass_20          6.7573   0.1721  39.2585 0.0000   6.4196  7.0950
MSSubClass_30          6.8542   0.2464  27.8128 0.0000   6.3707  7.3376
MSSubClass_40          7.2201   0.6573  10.9848 0.0000   5.9306  8.5096
MSSubClass_50          5.9028   0.3461  17.0541 0.0000   5.2238  6.5819
MSSubClass_60          6.7731   0.1937  34.9674 0.0000   6.3931  7.1531
MSSubClass_70          6.9171   0.2528  27.3590 0.0000   6.4211  7.4131
MSSubClass_75          6.8677   0.4134  16.6136 0.0000   6.0567  7.6787
MSSubClass_80          3.4819   0.4290   8.1159 0.0000   2.6402  4.3236
MSSubClass_85          1.9362   0.3842   5.0389 0.0000   1.1823  2.6900
MSZoning_FV            0.3947   0.2628   1.5019 0.1334  -0.1209  0.9102
MSZoning_RH            0.0576   0.2618   0.2201 0.8258  -0.4559  0.5712
MSZoning_RM            0.2197   0.1367   1.6068 0.1084  -0.0485  0.4879
Alley_NA              -0.2432   0.1622  -1.4990 0.1341  -0.5614  0.0751
Alley_Pave            -0.0800   0.2403  -0.3331 0.7391  -0.5515  0.3914
LandContour_HLS       -0.2067   0.1993  -1.0376 0.2997  -0.5977  0.1842
LandContour_Low       -0.1929   0.2465  -0.7827 0.4339  -0.6764  0.2906
LandContour_Lvl       -0.2980   0.1434  -2.0776 0.0379  -0.5794 -0.0166
LotConfig_CulDSac     -0.0359   0.1288  -0.2784 0.7807  -0.2886  0.2168
LotConfig_FR2          0.0432   0.1560   0.2770 0.7818  -0.2629  0.3493
LotConfig_FR3         -0.6433   0.4923  -1.3068 0.1915  -1.6091  0.3225
LotConfig_Inside      -0.1042   0.0688  -1.5137 0.1304  -0.2392  0.0308
Neighborhood_Blmngtn   2.0054   0.3103   6.4630 0.0000   1.3967  2.6141
Neighborhood_Blueste   2.5723   0.7031   3.6586 0.0003   1.1930  3.9517
Neighborhood_BrDale    1.2022   0.3558   3.3789 0.0008   0.5042  1.9002
Neighborhood_BrkSide   1.7745   0.2280   7.7817 0.0000   1.3271  2.2219
Neighborhood_ClearCr   1.5609   0.2426   6.4349 0.0000   1.0850  2.0368
Neighborhood_CollgCr   1.8058   0.1613  11.1966 0.0000   1.4894  2.1222
Neighborhood_Crawfor   2.2026   0.2061  10.6855 0.0000   1.7982  2.6070
Neighborhood_Edwards   1.8617   0.1710  10.8883 0.0000   1.5263  2.1971
Neighborhood_Gilbert   1.8837   0.1798  10.4779 0.0000   1.5310  2.2364
Neighborhood_IDOTRR    1.8718   0.2658   7.0414 0.0000   1.3503  2.3934
Neighborhood_MeadowV   1.9667   0.3878   5.0718 0.0000   1.2059  2.7274
Neighborhood_Mitchel   2.1168   0.1828  11.5796 0.0000   1.7582  2.4755
Neighborhood_NAmes     1.7796   0.1368  13.0114 0.0000   1.5112  2.0479
Neighborhood_NPkVill   2.5897   0.4985   5.1950 0.0000   1.6117  3.5676
Neighborhood_NoRidge   1.5353   0.2115   7.2604 0.0000   1.1205  1.9502
Neighborhood_NridgHt   1.7708   0.2032   8.7159 0.0000   1.3722  2.1693
Neighborhood_OldTown   1.7123   0.2282   7.5048 0.0000   1.2647  2.1599
Neighborhood_SWISU     1.9300   0.2722   7.0896 0.0000   1.3959  2.4641
Neighborhood_Sawyer    1.8231   0.1633  11.1618 0.0000   1.5026  2.1435
Neighborhood_SawyerW   1.8344   0.1801  10.1875 0.0000   1.4811  2.1877
Neighborhood_Somerst   1.5377   0.2651   5.7998 0.0000   1.0175  2.0578
Neighborhood_StoneBr   2.0548   0.2550   8.0566 0.0000   1.5544  2.5552
Neighborhood_Timber    1.5075   0.2103   7.1676 0.0000   1.0949  1.9201
Neighborhood_Veenker   1.3761   0.3219   4.2748 0.0000   0.7446  2.0076
Condition1_Artery     -0.3935   0.2853  -1.3792 0.1681  -0.9533  0.1662
Condition1_Feedr      -0.2763   0.2648  -1.0434 0.2970  -0.7958  0.2432
Condition1_Norm       -0.3480   0.2380  -1.4620 0.1440  -0.8149  0.1190
Condition1_PosA       -0.5814   0.4077  -1.4261 0.1541  -1.3813  0.2184
Condition1_RRAe       -0.2191   0.3845  -0.5700 0.5688  -0.9734  0.5352
Condition1_RRAn        0.3512   0.3106   1.1306 0.2584  -0.2582  0.9606
Condition1_RRNe       -0.9747   0.7006  -1.3912 0.1644  -2.3492  0.3998
Condition1_RRNn        0.2323   0.5186   0.4480 0.6542  -0.7850  1.2497
Condition2_Artery     -4.8565   0.9515  -5.1042 0.0000  -6.7231 -2.9899
Condition2_Feedr      -0.9297   0.7141  -1.3019 0.1932  -2.3305  0.4712
Condition2_Norm       -0.8846   0.5414  -1.6339 0.1025  -1.9467  0.1775
Condition2_PosA       -0.1138   1.1541  -0.0986 0.9215  -2.3780  2.1504
Condition2_PosN       -1.6024   1.1040  -1.4515 0.1469  -3.7683  0.5634
BldgType_2fmCon       -0.3857   0.9797  -0.3937 0.6939  -2.3076  1.5363
BldgType_Twnhs        -0.2770   0.2039  -1.3585 0.1746  -0.6769  0.1230
HouseStyle_15Fin       0.7052   0.2860   2.4656 0.0138   0.1441  1.2664
HouseStyle_15Unf       6.1676   0.3271  18.8543 0.0000   5.5258  6.8093
HouseStyle_25Unf      -0.6267   0.4387  -1.4285 0.1534  -1.4873  0.2340
HouseStyle_SFoyer      4.4999   0.3489  12.8969 0.0000   3.8154  5.1845
HouseStyle_SLvl        3.1980   0.4180   7.6511 0.0000   2.3780  4.0180
RoofStyle_Gambrel     -0.8628   0.3155  -2.7348 0.0063  -1.4817 -0.2439
RoofStyle_Hip         -0.0084   0.0709  -0.1190 0.9053  -0.1476  0.1307
RoofStyle_Mansard      0.6045   0.4200   1.4391 0.1504  -0.2196  1.4285
RoofStyle_Shed         1.5284   0.9577   1.5960 0.1107  -0.3503  3.4072
RoofMatl_CompShg      -2.4608   0.5729  -4.2953 0.0000  -3.5848 -1.3369
RoofMatl_TarGrv       -1.6400   0.6298  -2.6040 0.0093  -2.8755 -0.4044
RoofMatl_WdShake      -3.3259   0.7656  -4.3442 0.0000  -4.8279 -1.8239
RoofMatl_WdShngl      -2.3756   0.6994  -3.3965 0.0007  -3.7477 -1.0034
Exterior1st_AsbShng   -0.9093   0.4824  -1.8848 0.0597  -1.8557  0.0372
Exterior1st_AsphShn    1.8023   1.1940   1.5095 0.1314  -0.5401  4.1447
Exterior1st_BrkComm   -0.6524   0.9945  -0.6560 0.5119  -2.6034  1.2986
Exterior1st_BrkFace   -0.1617   0.2736  -0.5910 0.5546  -0.6985  0.3751
Exterior1st_CemntBd   -0.5950   0.6027  -0.9872 0.3237  -1.7775  0.5875
Exterior1st_HdBoard   -0.5452   0.2380  -2.2910 0.0221  -1.0121 -0.0783
Exterior1st_Plywood    0.0020   0.2514   0.0081 0.9935  -0.4912  0.4953
Exterior1st_Stucco     0.5955   0.3661   1.6266 0.1041  -0.1227  1.3137
Exterior1st_WdSdng    -0.3393   0.2281  -1.4874 0.1372  -0.7868  0.1082
Exterior1st_WdShing    0.5432   0.2896   1.8757 0.0609  -0.0249  1.1114
Exterior2nd_AsbShng    0.9187   0.4682   1.9621 0.0500   0.0001  1.8373
Exterior2nd_AsphShn    0.5424   0.7220   0.7512 0.4527  -0.8741  1.9590
Exterior2nd_BrkCmn    -0.0520   0.7077  -0.0735 0.9415  -1.4404  1.3364
Exterior2nd_BrkFace    0.9922   0.3324   2.9848 0.0029   0.3400  1.6443
Exterior2nd_CmentBd    0.7936   0.6071   1.3070 0.1914  -0.3976  1.9847
Exterior2nd_HdBoard    0.9903   0.2414   4.1026 0.0000   0.5167  1.4638
Exterior2nd_ImStucc    0.6777   0.3550   1.9091 0.0565  -0.0187  1.3741
Exterior2nd_MetalSd    0.4087   0.0987   4.1414 0.0000   0.2151  0.6023
Exterior2nd_Plywood    0.8364   0.2411   3.4698 0.0005   0.3635  1.3093
Exterior2nd_Stone      0.9148   0.4860   1.8825 0.0600  -0.0386  1.8682
Exterior2nd_Stucco    -0.4419   0.3725  -1.1864 0.2357  -1.1727  0.2889
Exterior2nd_WdSdng     0.7006   0.2362   2.9658 0.0031   0.2371  1.1640
Exterior2nd_WdShng     0.1959   0.2592   0.7559 0.4499  -0.3125  0.7043
MasVnrType_BrkFace     0.7093   0.2293   3.0937 0.0020   0.2595  1.1592
MasVnrType_Stone       0.6706   0.2458   2.7287 0.0064   0.1885  1.1527
Foundation_BrkTil     -0.0239   0.1216  -0.1968 0.8441  -0.2624  0.2145
Foundation_PConc       0.1183   0.0937   1.2634 0.2067  -0.0654  0.3021
Foundation_Slab        0.8420   0.2952   2.8522 0.0044   0.2628  1.4211
Foundation_Stone      -0.3689   0.4235  -0.8710 0.3839  -1.1997  0.4619
Foundation_Wood        0.2868   0.5571   0.5148 0.6068  -0.8061  1.3796
Heating_GasA           0.0376   0.5696   0.0659 0.9474  -1.0799  1.1550
Heating_GasW           0.3365   0.6065   0.5548 0.5791  -0.8533  1.5263
Heating_Grav          -0.6319   0.6734  -0.9383 0.3483  -1.9530  0.6893
Heating_Wall           0.9190   0.7926   1.1595 0.2465  -0.6359  2.4739
Electrical_FuseA       0.0597   0.1148   0.5199 0.6032  -0.1655  0.2849
Electrical_FuseF      -0.5839   0.2129  -2.7431 0.0062  -1.0015 -0.1663
Electrical_FuseP      -0.9416   0.6362  -1.4800 0.1391  -2.1899  0.3066
GarageType_Basment    -0.0797   0.2513  -0.3169 0.7513  -0.5727  0.4134
GarageType_BuiltIn    -0.0359   0.1198  -0.2997 0.7644  -0.2709  0.1991
GarageType_CarPort     1.6014   0.3574   4.4813 0.0000   0.9003  2.3025
GarageType_Detchd      0.3575   0.0864   4.1390 0.0000   0.1880  0.5269
GarageType_NA          0.7564   0.4207   1.7980 0.0724  -0.0689  1.5817
MiscFeature_Othr      -1.3952   0.8546  -1.6326 0.1028  -3.0717  0.2814
MiscFeature_Pool      -1.0293   0.7515  -1.3696 0.1711  -2.5037  0.4451
MiscFeature_Shed       0.5269   0.6317   0.8342 0.4043  -0.7123  1.7662
MoSold_1               1.1259   0.1621   6.9443 0.0000   0.8078  1.4440
MoSold_11              1.0230   0.1464   6.9897 0.0000   0.7358  1.3101
MoSold_12              1.0071   0.1586   6.3499 0.0000   0.6960  1.3183
MoSold_2               1.0344   0.1639   6.3118 0.0000   0.7129  1.3560
MoSold_3               0.8904   0.1381   6.4469 0.0000   0.6194  1.1613
MoSold_4               0.9953   0.1285   7.7439 0.0000   0.7432  1.2475
MoSold_5               1.0382   0.1207   8.6031 0.0000   0.8014  1.2749
MoSold_6               1.1108   0.1167   9.5156 0.0000   0.8818  1.3398
MoSold_7               1.0463   0.1162   9.0068 0.0000   0.8184  1.2742
MoSold_8               1.0505   0.1299   8.0894 0.0000   0.7957  1.3053
MoSold_9               1.0452   0.1565   6.6801 0.0000   0.7382  1.3521
SaleType_CWD          -0.2217   0.4835  -0.4585 0.6466  -1.1702  0.7268
SaleType_Con           0.3509   0.6780   0.5175 0.6049  -0.9793  1.6810
SaleType_ConLD        -0.3335   0.3663  -0.9105 0.3627  -1.0522  0.3852
SaleType_ConLI         0.2263   0.4325   0.5232 0.6009  -0.6223  1.0749
SaleType_ConLw         0.4022   0.4458   0.9023 0.3671  -0.4724  1.2768
SaleType_Oth           0.1594   0.6000   0.2657 0.7905  -1.0176  1.3365
SaleType_WD            0.0061   0.1045   0.0584 0.9535  -0.1989  0.2111
SaleCondition_Abnorml  0.0206   0.1057   0.1948 0.8456  -0.1867  0.2279
SaleCondition_AdjLand  4.3946   0.5740   7.6560 0.0000   3.2684  5.5207
SaleCondition_Alloca  -0.9173   0.3367  -2.7244 0.0065  -1.5778 -0.2567
SaleCondition_Family   0.3557   0.2142   1.6609 0.0970  -0.0644  0.7758
Remod                  0.2063   0.0692   2.9804 0.0029   0.0705  0.3420
IsNew                  0.1244   0.1623   0.7666 0.4435  -0.1940  0.4428
-----------------------------------------------------------------------
Omnibus:               274.775        Durbin-Watson:           2.030   
Prob(Omnibus):         0.000          Jarque-Bera (JB):        3355.726
Skew:                  0.502          Prob(JB):                0.000   
Kurtosis:              10.388         Condition No.:           6023    
=======================================================================
* The condition number is large (6e+03). This might indicate
strong multicollinearity or other numerical problems.

Like before, I excluded one by one of the features with the highest P-value and run again until get only P-values up to 0.05, but here I use a backward elimination process.

In [90]:
pv_cols = cols.values

def backwardElimination(x, Y, sl, columns):
    ini = len(columns)
    numVars = x.shape[1]
    for i in range(0, numVars):
        regressor = sm.OLS(Y, x).fit()
        maxVar = max(regressor.pvalues) #.astype(float)
        if maxVar > sl:
            for j in range(0, numVars - i):
                if (regressor.pvalues[j].astype(float) == maxVar):
                    columns = np.delete(columns, j)
                    x = x.loc[:, columns]
                    
    print('\nSelect {:d} features from {:d} by best p-values.'.format(len(columns), ini))
    print('The max p-value from the features selecte is {:.3f}.'.format(maxVar))
    print(regressor.summary())
    
    # odds ratios and 95% CI
    conf = np.exp(regressor.conf_int())
    conf['Odds Ratios'] = np.exp(regressor.params)
    conf.columns = ['2.5%', '97.5%', 'Odds Ratios']
    display(conf)
    
    return columns, regressor

SL = 0.051

pv_cols, LR = backwardElimination(df, y_train, SL, pv_cols)
Select 107 features from 188 by best p-values.
The max p-value from the features selecte is 0.050.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              SalePrice   R-squared:                       0.995
Model:                            OLS   Adj. R-squared:                  0.995
Method:                 Least Squares   F-statistic:                     2454.
Date:                Wed, 17 Oct 2018   Prob (F-statistic):               0.00
Time:                        11:26:31   Log-Likelihood:                -1833.9
No. Observations:                1449   AIC:                             3882.
Df Residuals:                    1342   BIC:                             4447.
Df Model:                         107                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Street                   -1.8422      0.441     -4.179      0.000      -2.707      -0.977
LandSlope                 0.2579      0.105      2.444      0.015       0.051       0.465
OverallCond               0.1762      0.029      6.108      0.000       0.120       0.233
YearBuilt                -1.0450      0.121     -8.653      0.000      -1.282      -0.808
YearRemodAdd              0.3465      0.078      4.432      0.000       0.193       0.500
BsmtQual                 -0.3040      0.058     -5.224      0.000      -0.418      -0.190
BsmtExposure              0.1417      0.040      3.571      0.000       0.064       0.219
BsmtFinType1             -0.1729      0.075     -2.290      0.022      -0.321      -0.025
HeatingQC                -1.1701      0.102    -11.509      0.000      -1.370      -0.971
BsmtHalfBath              0.2708      0.109      2.486      0.013       0.057       0.484
KitchenAbvGr             28.0325      1.499     18.696      0.000      25.091      30.974
KitchenQual              -0.1777      0.073     -2.432      0.015      -0.321      -0.034
TotRmsAbvGrd              0.2062      0.061      3.371      0.001       0.086       0.326
Functional                0.0809      0.040      2.043      0.041       0.003       0.159
Fireplaces               -1.4306      0.126    -11.388      0.000      -1.677      -1.184
GarageFinish             -0.1349      0.044     -3.071      0.002      -0.221      -0.049
OpenPorchSF              -0.2296      0.058     -3.935      0.000      -0.344      -0.115
TotalExtraPoints          1.8834      0.196      9.622      0.000       1.499       2.267
ConstructArea             0.3645      0.074      4.898      0.000       0.219       0.511
TotBathrooms              0.1495      0.027      5.521      0.000       0.096       0.203
MSSubClass_120            6.8068      0.201     33.837      0.000       6.412       7.201
MSSubClass_160            6.2605      0.238     26.292      0.000       5.793       6.728
MSSubClass_180            2.4669      0.445      5.548      0.000       1.595       3.339
MSSubClass_190            4.4388      0.255     17.440      0.000       3.939       4.938
MSSubClass_20             6.6830      0.151     44.134      0.000       6.386       6.980
MSSubClass_30             6.8001      0.218     31.187      0.000       6.372       7.228
MSSubClass_40             7.1887      0.565     12.729      0.000       6.081       8.297
MSSubClass_50             5.9983      0.318     18.855      0.000       5.374       6.622
MSSubClass_60             6.7731      0.163     41.548      0.000       6.453       7.093
MSSubClass_70             6.9452      0.228     30.498      0.000       6.499       7.392
MSSubClass_75             6.6864      0.307     21.768      0.000       6.084       7.289
MSSubClass_80             3.5720      0.413      8.641      0.000       2.761       4.383
MSSubClass_85             1.9117      0.359      5.323      0.000       1.207       2.616
LotConfig_Inside         -0.1160      0.055     -2.111      0.035      -0.224      -0.008
Neighborhood_Blmngtn      1.9140      0.288      6.639      0.000       1.348       2.480
Neighborhood_Blueste      2.3878      0.672      3.551      0.000       1.069       3.707
Neighborhood_BrDale       1.1992      0.309      3.877      0.000       0.592       1.806
Neighborhood_BrkSide      1.9969      0.196     10.187      0.000       1.612       2.381
Neighborhood_ClearCr      1.6838      0.223      7.549      0.000       1.246       2.121
Neighborhood_CollgCr      1.8782      0.148     12.671      0.000       1.587       2.169
Neighborhood_Crawfor      2.2557      0.190     11.859      0.000       1.883       2.629
Neighborhood_Edwards      1.9355      0.158     12.235      0.000       1.625       2.246
Neighborhood_Gilbert      2.0473      0.164     12.516      0.000       1.726       2.368
Neighborhood_IDOTRR       2.1374      0.221      9.654      0.000       1.703       2.572
Neighborhood_MeadowV      2.2913      0.319      7.183      0.000       1.666       2.917
Neighborhood_Mitchel      2.2565      0.172     13.140      0.000       1.920       2.593
Neighborhood_NAmes        1.8678      0.128     14.625      0.000       1.617       2.118
Neighborhood_NPkVill      2.3699      0.345      6.872      0.000       1.693       3.046
Neighborhood_NoRidge      1.6534      0.192      8.617      0.000       1.277       2.030
Neighborhood_NridgHt      1.7619      0.183      9.623      0.000       1.403       2.121
Neighborhood_OldTown      1.9496      0.178     10.944      0.000       1.600       2.299
Neighborhood_SWISU        2.1487      0.243      8.842      0.000       1.672       2.625
Neighborhood_Sawyer       1.9328      0.153     12.660      0.000       1.633       2.232
Neighborhood_SawyerW      1.8904      0.165     11.490      0.000       1.568       2.213
Neighborhood_Somerst      1.9352      0.173     11.195      0.000       1.596       2.274
Neighborhood_StoneBr      2.1167      0.234      9.054      0.000       1.658       2.575
Neighborhood_Timber       1.6339      0.195      8.381      0.000       1.251       2.016
Neighborhood_Veenker      1.4596      0.303      4.818      0.000       0.865       2.054
Condition1_RRAn           0.6304      0.190      3.323      0.001       0.258       1.003
Condition2_Artery        -3.8490      0.720     -5.348      0.000      -5.261      -2.437
HouseStyle_15Fin          0.5951      0.270      2.204      0.028       0.065       1.125
HouseStyle_15Unf          6.0276      0.301     20.002      0.000       5.436       6.619
HouseStyle_SFoyer         4.3510      0.320     13.601      0.000       3.723       4.979
HouseStyle_SLvl           3.0246      0.399      7.586      0.000       2.242       3.807
RoofStyle_Gambrel        -0.7991      0.289     -2.762      0.006      -1.366      -0.232
RoofStyle_Mansard         0.8403      0.380      2.212      0.027       0.095       1.586
RoofStyle_Shed            1.8170      0.749      2.427      0.015       0.348       3.286
RoofMatl_CompShg         -2.6080      0.555     -4.696      0.000      -3.698      -1.518
RoofMatl_TarGrv          -1.8215      0.611     -2.980      0.003      -3.020      -0.623
RoofMatl_WdShake         -3.7890      0.727     -5.209      0.000      -5.216      -2.362
RoofMatl_WdShngl         -2.3945      0.672     -3.564      0.000      -3.713      -1.076
Exterior1st_AsbShng      -0.8978      0.414     -2.167      0.030      -1.711      -0.085
Exterior1st_AsphShn       2.6022      0.927      2.807      0.005       0.783       4.421
Exterior1st_HdBoard      -0.4252      0.155     -2.739      0.006      -0.730      -0.121
Exterior1st_Stucco        0.4300      0.207      2.075      0.038       0.024       0.836
Exterior1st_WdShing       0.6659      0.193      3.443      0.001       0.286       1.045
Exterior2nd_AsbShng       0.8548      0.411      2.079      0.038       0.048       1.661
Exterior2nd_BrkFace       0.9753      0.201      4.842      0.000       0.580       1.370
Exterior2nd_HdBoard       0.8523      0.164      5.203      0.000       0.531       1.174
Exterior2nd_MetalSd       0.4280      0.089      4.791      0.000       0.253       0.603
Exterior2nd_Plywood       0.7569      0.106      7.111      0.000       0.548       0.966
Exterior2nd_Stone         1.1400      0.430      2.650      0.008       0.296       1.984
Exterior2nd_WdSdng        0.4323      0.095      4.557      0.000       0.246       0.618
MasVnrType_BrkFace        0.4143      0.067      6.143      0.000       0.282       0.547
MasVnrType_Stone          0.3483      0.108      3.240      0.001       0.137       0.559
Foundation_Slab           1.1460      0.244      4.692      0.000       0.667       1.625
Electrical_FuseF         -0.4777      0.194     -2.463      0.014      -0.858      -0.097
GarageType_CarPort        1.4772      0.336      4.397      0.000       0.818       2.136
GarageType_Detchd         0.3469      0.080      4.325      0.000       0.190       0.504
GarageType_NA             0.5887      0.150      3.919      0.000       0.294       0.883
MiscFeature_Othr         -1.6040      0.727     -2.207      0.027      -3.030      -0.178
MiscFeature_Pool         -1.2589      0.406     -3.102      0.002      -2.055      -0.463
MoSold_1                  1.1578      0.156      7.408      0.000       0.851       1.464
MoSold_11                 1.1267      0.141      7.990      0.000       0.850       1.403
MoSold_12                 1.0830      0.153      7.060      0.000       0.782       1.384
MoSold_2                  1.0676      0.158      6.761      0.000       0.758       1.377
MoSold_3                  0.9900      0.132      7.494      0.000       0.731       1.249
MoSold_4                  1.0419      0.123      8.456      0.000       0.800       1.284
MoSold_5                  1.1347      0.115      9.852      0.000       0.909       1.361
MoSold_6                  1.2030      0.110     10.892      0.000       0.986       1.420
MoSold_7                  1.1190      0.112     10.026      0.000       0.900       1.338
MoSold_8                  1.1502      0.126      9.110      0.000       0.903       1.398
MoSold_9                  1.1705      0.151      7.739      0.000       0.874       1.467
SaleCondition_AdjLand     4.1419      0.524      7.901      0.000       3.114       5.170
SaleCondition_Alloca     -1.0346      0.314     -3.296      0.001      -1.650      -0.419
SaleCondition_Family      0.4078      0.208      1.964      0.050       0.000       0.815
Remod                     0.1894      0.064      2.978      0.003       0.065       0.314
==============================================================================
Omnibus:                      277.274   Durbin-Watson:                   2.029
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3343.331
Skew:                           0.516   Prob(JB):                         0.00
Kurtosis:                      10.370   Cond. No.                         141.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
2.5% 97.5% Odds Ratios
Street 0.067 0.376 0.158
LandSlope 1.052 1.592 1.294
OverallCond 1.127 1.262 1.193
YearBuilt 0.277 0.446 0.352
YearRemodAdd 1.213 1.649 1.414
BsmtQual 0.658 0.827 0.738
BsmtExposure 1.066 1.245 1.152
BsmtFinType1 0.725 0.976 0.841
HeatingQC 0.254 0.379 0.310
BsmtHalfBath 1.059 1.623 1.311
KitchenAbvGr 78865047671.961 28300788513415.922 1493968886980.128
KitchenQual 0.725 0.966 0.837
TotRmsAbvGrd 1.090 1.386 1.229
Functional 1.003 1.172 1.084
Fireplaces 0.187 0.306 0.239
GarageFinish 0.802 0.952 0.874
OpenPorchSF 0.709 0.891 0.795
TotalExtraPoints 4.479 9.655 6.576
ConstructArea 1.244 1.666 1.440
TotBathrooms 1.101 1.225 1.161
MSSubClass_120 609.208 1341.322 903.960
MSSubClass_160 328.118 835.152 523.477
MSSubClass_180 4.926 28.196 11.786
MSSubClass_190 51.391 139.499 84.670
MSSubClass_20 593.464 1075.016 798.739
MSSubClass_30 585.422 1377.206 897.913
MSSubClass_40 437.381 4010.232 1324.387
MSSubClass_50 215.770 751.738 402.743
MSSubClass_60 634.780 1203.359 873.995
MSSubClass_70 664.146 1622.929 1038.201
... ... ... ...
Exterior2nd_BrkFace 1.786 3.937 2.652
Exterior2nd_HdBoard 1.701 3.234 2.345
Exterior2nd_MetalSd 1.288 1.828 1.534
Exterior2nd_Plywood 1.730 2.627 2.132
Exterior2nd_Stone 1.345 7.270 3.127
Exterior2nd_WdSdng 1.279 1.856 1.541
MasVnrType_BrkFace 1.326 1.727 1.513
MasVnrType_Stone 1.147 1.749 1.417
Foundation_Slab 1.948 5.079 3.146
Electrical_FuseF 0.424 0.907 0.620
GarageType_CarPort 2.266 8.468 4.381
GarageType_Detchd 1.209 1.656 1.415
GarageType_NA 1.342 2.419 1.802
MiscFeature_Othr 0.048 0.837 0.201
MiscFeature_Pool 0.128 0.630 0.284
MoSold_1 2.343 4.325 3.183
MoSold_11 2.340 4.069 3.085
MoSold_12 2.186 3.991 2.954
MoSold_2 2.134 3.965 2.908
MoSold_3 2.077 3.487 2.691
MoSold_4 2.226 3.610 2.835
MoSold_5 2.481 3.899 3.110
MoSold_6 2.681 4.136 3.330
MoSold_7 2.460 3.811 3.062
MoSold_8 2.466 4.047 3.159
MoSold_9 2.396 4.337 3.223
SaleCondition_AdjLand 22.501 175.961 62.923
SaleCondition_Alloca 0.192 0.658 0.355
SaleCondition_Family 1.000 2.260 1.504
Remod 1.067 1.369 1.209

107 rows × 3 columns

From the results, we can highlight:

  • we're very confident about some relationship between the probability of raising prices:
    • there is an sgnificante inverse relationship with Fireplaces and Roof Material CompShg.
    • there is an positive relationship, from greater to low, with HalfBath, BsmtFinType2, BldgType, Foundation, MasVnrType, YearRemodAdd, Fence, BsmtHalfBath, HouseStyle, ExterCond, Exterior2nd, FullBath, TotalExtraPoints, BsmtFullBath, Exterior1st, KitchenAbvGr, Neighborhood image
  • From the coefficient:
    • As expected, the neighborhood makes a lot of difference, which is confirmed by the presence of all 24 dummy dummies after only one exclusion by the FIV.
    • From MSSubClass category we can see that thirteen of them are among the first seventeen highest coefficients.
    • Is interest to note that KitchenAbvGr is the most highest coefficient, you can imagine this before?
    • Note that from our chosen polynomial variables were selected only construction area and Total Extra Points.

Take the exponential of each of the coefficients to generate the odds ratios. This tells you how a 1 unit increase or decrease in a variable affects the odds of raising prices. For example, we can expect the odds of price to decreases n 18.5% with the numbers of Fireplaces.

image

Let's take a look at the graphs of some of the interactions of the selected features:

In [91]:
pred = LR.predict(df[pv_cols])

df_copy['proba'] = pred

y_pred = pred.apply(lambda x: 1 if x > 0.5 else 0)

print('Fvalue: {:.6f}'.format(LR.fvalue))
print('MSE total on the train data: {:.4f}'.format(LR.mse_total))
def plot_proba(continous, predict, discret, data):
    grouped = pd.pivot_table(data, values=[predict], index=[continous, discret], aggfunc=np.mean)
    colors = 'rbgyrbgy'
    for col in data[discret].unique():
        plt_data = grouped.ix[grouped.index.get_level_values(1)==col]
        plt.plot(plt_data.index.get_level_values(0), plt_data[predict], color=colors[int(col)])
    plt.xlabel(continous)
    plt.ylabel("Probabilities")
    plt.legend(np.sort(data[discret].unique()), loc='upper left', title=discret)
    plt.title("Probabilities with " + continous + " and " + discret)

fig = plt.figure(figsize=(30, 20))
ax = fig.add_subplot(241); plot_proba('ConstructArea', 'SalePrice', 'Remod', df_copy)
ax = fig.add_subplot(242); plot_proba('TotalPoints', 'SalePrice', 'ExterCond', df_copy)
ax = fig.add_subplot(243); plot_proba('BsmtFinType1', 'SalePrice', 'BsmtCond', df_copy)
ax = fig.add_subplot(245); plot_proba('TotalExtraPoints', 'SalePrice', 'LotShape', df_copy)

plt.show()
del df_copy
Fvalue: 2453.619954
MSE total on the train data: 144.7138

As expected, we can see that prices grow with the growth of the built area, although the reform does not seem to contribute higher prices, in fact we have to remember that if a house went through renovation it is indeed old enough to have needed, and New homes tend to be more expensive. If you wish, switch from Remod to IsNew and see for yourself.

Something similar can be seen in relation to the total points segmented by the external condition, while we see that prices grow with the growth of the points, we see that although the external condition presents a small positive coefficient, the graph may be suggesting something different, but note that level 3 stands out at the beginning and around the mean, which would explain a small positive coefficient.

Also see that more important than basement conditions is its purpose in itself. Basements with living conditions present higher prices, curiously unfinished ones too, perhaps because they get the new owners to make them what they want.

As for the lot multiplied by the slope, as we already know we see the trend of price increase with lot size, but much variation, since other aspects influence the price as well as the slope itself.

Finally, the total of extra points, there is nothing new when we see that the bigger the better, segmented by the format of the lot, we see that the more regular the better it is, but if the terrain is unregulated the extra high score will not work.

This is the beauty of linear models, even with many features it is possible to understand them when evaluating their coefficients, significance and graphs , as we can see how certain variables present noise and its influence on variance and bias.

Select Features by Recursive Feature Elimination

The goal of Recursive Feature Elimination (RFE) is to select features by recursively considering smaller and smaller sets of features.

RFE is based on the idea to repeatedly construct a model and choose either the best or worst performing feature, setting the feature aside and then repeating the process with the rest of the features. This process is applied until all features in the dataset are exhausted.

Other option is sequential Feature Selector (SFS) from mlxtend, a separate Python library that is designed to work well with scikit-learn, also provides a S that works a bit differently.

RFE is computationally less complex using the feature's weight coefficients (e.g., linear models) or feature importances (tree-based algorithms) to eliminate features recursively, whereas SFSs eliminate (or add) features based on a user-defined classifier/regression performance metric.

The scikit-learn has two implementations of RFE, let's see the RFECV through a Lasso model to make the feature ranking with recursive feature elimination and cross-validated selection of the best number of features.

In fact, this algorithm is quite efficient and makes a selection that will produce the best results when we go through the hyper parametrization phase.

In [92]:
ls = Lasso(alpha = 0.0005, max_iter = 161, selection = 'cyclic', tol = 0.002, random_state = 101)
rfecv = RFECV(estimator=ls, n_jobs = -1, step=1, scoring = 'neg_mean_squared_error' ,cv=5)
rfecv.fit(df, y_train)

select_features_rfecv = rfecv.get_support()
RFEcv = cols[select_features_rfecv]
print('{:d} Features Select by RFEcv:\n{:}'.format(rfecv.n_features_, RFEcv.values))
79 Features Select by RFEcv:
['LotFrontage' 'LandSlope' 'OverallCond' 'YearBuilt' 'YearRemodAdd'
 'MasVnrArea' 'BsmtQual' 'BsmtCond' 'BsmtFinType1' 'BsmtFinType2'
 'HeatingQC' 'CentralAir' 'BsmtHalfBath' 'HalfBath' 'BedroomAbvGr'
 'KitchenQual' 'TotRmsAbvGrd' 'Functional' 'Fireplaces' 'GarageFinish'
 'PavedDrive' 'OpenPorchSF' 'EnclosedPorch' 'YrSold' 'TotalPoints'
 'GarageArea_x_Car' 'ConstructArea' 'LotAreaMultSlope' 'TotBathrooms'
 'MSSubClass_160' 'MSSubClass_20' 'MSSubClass_30' 'MSSubClass_50'
 'MSSubClass_70' 'MSZoning_FV' 'MSZoning_RM' 'Alley_Pave'
 'LandContour_Lvl' 'LotConfig_CulDSac' 'LotConfig_FR2'
 'Neighborhood_BrkSide' 'Neighborhood_ClearCr' 'Neighborhood_CollgCr'
 'Neighborhood_Crawfor' 'Neighborhood_Edwards' 'Neighborhood_IDOTRR'
 'Neighborhood_MeadowV' 'Neighborhood_Mitchel' 'Neighborhood_NoRidge'
 'Neighborhood_NridgHt' 'Neighborhood_OldTown' 'Neighborhood_Somerst'
 'Neighborhood_StoneBr' 'Condition1_Artery' 'Condition1_Norm'
 'Condition1_RRAe' 'RoofStyle_Hip' 'Exterior1st_BrkFace'
 'Exterior1st_HdBoard' 'Exterior1st_Plywood' 'Exterior1st_WdSdng'
 'Exterior2nd_HdBoard' 'Exterior2nd_MetalSd' 'Exterior2nd_Plywood'
 'MasVnrType_Stone' 'Foundation_BrkTil' 'Foundation_PConc'
 'Foundation_Slab' 'Heating_GasW' 'GarageType_BuiltIn' 'GarageType_NA'
 'MiscFeature_Shed' 'MoSold_11' 'MoSold_3' 'MoSold_5' 'MoSold_7'
 'SaleType_WD' 'SaleCondition_Abnorml' 'SaleCondition_Family']

Sequential feature selection

Sequential feature selection algorithms are a family of greedy search algorithms that are used to reduce an initial d-dimensional feature space to a k-dimensional feature subspace where k < d. The motivation behind feature selection algorithms is to automatically select a subset of features that are most relevant to the problem to improve computational efficiency or reduce the generalization error of the model by removing irrelevant features or noise, which can be useful for algorithms that don't support regularization.

Greedy algorithms make locally optimal choices at each stage of a combinatorial search problem and generally yield a suboptimal solution to the problem in contrast to exhaustive search algorithms, which evaluate all possible combinations and are guaranteed to find the optimal solution. However, in practice, an exhaustive search is often computationally not feasible, whereas greedy algorithms allow for a less complex, computationally more efficient solution.

As you saw in the previous topic, RFE is computationally less complex using the feature weight coefficients (e.g., linear models) or feature importance (tree-based algorithms) to eliminate features recursively, whereas SFSs eliminate (or add) features based on a user-defined classifier/regression performance metric.

The SBS aims to reduce the dimensionality of the initial feature subspace with a minimum decay in performance of the regressor or classifier to improve upon computational efficiency. In certain cases, SBS can even improve the predictive power of the model if a model suffers from overfitting.

SBS sequentially removes features from the full feature subset until the new feature subspace contains the desired number of features. In order to determine which feature is to be removed at each stage, we need to define criterion function J that we want to minimize. The criterion calculated by the criterion function can simply be the difference in performance of the classifier after and before the removal of a particular feature. Then the feature to be removed at each stage can simply be defined as the feature that maximizes this criterion.

So, let's see a example of SBS in our data,

In [93]:
from itertools import combinations
class SBS():
    def __init__(self, estimator, k_features, scoring=r2_score, test_size=0.25, random_state=101):
        self.scoring = scoring
        self.estimator = clone(estimator)
        self.k_features = k_features
        self.test_size = test_size
        self.random_state = random_state

    def fit(self, X, y):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.test_size, random_state=self.random_state)
        dim = X_train.shape[1]
        self.indices_ = list(range(dim))
        self.subsets_ = [self.indices_]
        score = self._calc_score(X_train, y_train, X_test, y_test, self.indices_)
        self.scores_ = [score]
        
        while dim > self.k_features:
            scores = []
            subsets = []
            for p in combinations(self.indices_, r=dim-1):
                score = self._calc_score(X_train, y_train, X_test, y_test, list(p))
                scores.append(score)
                subsets.append(list(p))
                
            best = np.argmax(scores)
            self.indices_ = subsets[best]
            self.subsets_.append(self.indices_)
            dim -= 1
            self.scores_.append(scores[best])
            
        self.k_score_ = self.scores_[-1]
        return self

    def transform(self, X):
        return X.iloc[:, self.indices_]
    
    def _calc_score(self, X_train, y_train, X_test, y_test, indices):
        self.estimator.fit(X_train.iloc[:, indices], y_train)
        y_pred = self.estimator.predict(X_test.iloc[:, indices])
        score = self.scoring(y_test, y_pred)
        return score

score = r2_score
ls = Lasso(alpha = 0.0005, max_iter = 161, selection = 'cyclic', tol = 0.002, random_state = 101)
sbs = SBS(ls, k_features=1, scoring= score)
sbs.fit(df, y_train)

k_feat = [len(k) for k in sbs.subsets_]
fig = plt.figure(figsize=(10,5))
plt.plot(k_feat, sbs.scores_, marker='o')
plt.xlim([1, len(sbs.subsets_)])
plt.xticks(np.arange(1, len(sbs.subsets_)+1))
plt.ylabel('R2 Score')
plt.xlabel('Number of features')
plt.grid(b=1)
plt.show()

print('Best Score: {:2.2%}\n'.format(max(sbs.scores_)))
print('Best score with:{0:2d}.\n'.\
      format(len(list(df.columns[sbs.subsets_[np.argmax(sbs.scores_)]]))))
SBS = list(df.columns[list(sbs.subsets_[max(np.arange(0, len(sbs.scores_))[(sbs.scores_==max(sbs.scores_))])])])
print('\nBest score with {0:2d} features:\n{1:}'.format(len(SBS), SBS))
Best Score: 93.18%

Best score with:52.


Best score with 52 features:
['OverallCond', 'MasVnrArea', 'BsmtCond', 'BsmtFinType2', 'CentralAir', 'HalfBath', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'GarageFinish', 'PavedDrive', 'EnclosedPorch', 'YrSold', 'TotalPoints', 'ConstructArea', 'LotAreaMultSlope', 'TotBathrooms', 'MSSubClass_160', 'MSSubClass_20', 'MSSubClass_60', 'LandContour_Lvl', 'LotConfig_CulDSac', 'LotConfig_FR2', 'Neighborhood_BrkSide', 'Neighborhood_Crawfor', 'Neighborhood_Edwards', 'Neighborhood_IDOTRR', 'Neighborhood_MeadowV', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_OldTown', 'Neighborhood_Somerst', 'Neighborhood_StoneBr', 'Condition1_Norm', 'Condition1_RRAe', 'HouseStyle_SFoyer', 'HouseStyle_SLvl', 'RoofMatl_CompShg', 'Exterior1st_WdSdng', 'Exterior2nd_HdBoard', 'Exterior2nd_Plywood', 'Foundation_BrkTil', 'Foundation_PConc', 'Foundation_Slab', 'Heating_GasW', 'GarageType_CarPort', 'MoSold_11', 'MoSold_3', 'MoSold_5', 'SaleCondition_Family', 'IsNew']

As you saw, the SBS is straightforward code to understand, but is computationally expensive. In a nutshell, SFAs remove or add one feature at the time based on the classifier or regressior performance until a feature subset of the desired size k is reached. There are 4 different flavors of SFAs available via the SequentialFeatureSelector from mlxtend:

  • Sequential Forward Selection (SFS)
  • Sequential Backward Selection (SBS)
  • Sequential Forward Floating Selection (SFFS)
  • Sequential Backward Floating Selection (SBFS)

The next code use the SBS from the mlxten. It has interest features to explore, but is more computationally expensive than previous code, so, take care if you try running it.

# Sequential Backward Selection ls = Lasso(alpha = 0.0005, max_iter = 161, selection = 'cyclic', tol = 0.002, random_state = 101) #lr = LinearRegression() sbs = SFS(estimator = ls, k_features = 52, forward= False, floating = False, scoring = 'neg_mean_squared_error', cv = 3, n_jobs = -1) sbs = sbs.fit(df.values, y_train.values) print('\nSequential Backward Selection:') print(sbs.k_feature_idx_) print('\nCV Score:', sbs.k_score_) display(pd.DataFrame.from_dict(sbs.get_metric_dict()).T) fig1 = plot_sfs(sbs.get_metric_dict(), kind='std_err') plt.title('Sequential Backward Selection (w. StdErr)') plt.grid() plt.show()

SBS is actually computationally expensive, but also generated models with better performance when we go through the hyper parameterization phase.

Feature Selection by Filter Methods

Filter methods use statistical methods for evaluation of a subset of features, they are generally used as a preprocessing step. These methods are also known as univariate feature selection, they examines each feature individually to determine the strength of the relationship of the feature with the dependent variable. These methods are simple to run and understand and are in general particularly good for gaining a better understanding of data, but not necessarily for optimizing the feature set for better generalization.

image

So, the features are selected on the basis of their scores in various statistical tests for their correlation with the outcome variable. The correlation is a subjective term here. For basic guidance, you can refer to the following table for defining correlation co-efficients.

Feature/Response Continuous Categorical
Continuous Pearson's Correlation LDA
Categorical Anova Chi-Square

One thing that should be kept in mind is that filter methods do not remove multicollinearity. So, you must deal with multicollinearity of features as well before training models for your data.

There are lot of different options for univariate selection. Some examples are:

  • Univariate feature selection
  • Model Based Ranking

Univariate feature selection

On scikit-learn we find variety of implementation oriented to regression tasks to select features according to the k highest scores, see below some of that:

  • f_regression The Pearson's Correlation are covert to F score then to a p-value. So, the selection is based on the F-value between label/feature for regression tasks.
  • mutual_info_regression estimate mutual information for a continuous target variable. The MI between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency. The function relies on nonparametric methods based on entropy estimation from k-nearest neighbors distances. image

The methods based on F-test estimate the degree of linear dependency between two random variables. On the other hand, mutual information methods can capture any kind of statistical dependency, but being nonparametric, they require more samples for accurate estimation.

Other important point is if you use sparse data, for example if we continue consider hot-encode of some categorical data with largest number of distinct values, mutual_info_regression will deal with the data without making it dense.

Let's see the SelectKBest of f_regression and mutual_info_regression for our data:

In [94]:
skb = SelectKBest(score_func=f_regression, k=80)
skb.fit(df, y_train)
select_features_kbest = skb.get_support()
kbest_FR = cols[select_features_kbest]
scores = skb.scores_[select_features_kbest]
feature_scores = pd.DataFrame([(item, score) for item, score in zip(kbest_FR, scores)], columns=['feature', 'score'])
fig = plt.figure(figsize=(40,20))
f1 = fig.add_subplot(121)
feature_scores.sort_values(by='score', ascending=True).plot(y = 'score', x = 'feature', kind='barh', 
                                                            ax = f1, fontsize=10, grid=True) 

skb = SelectKBest(score_func=mutual_info_regression, k=80)
skb.fit(df, y_train)
select_features_kbest = skb.get_support()
kbest_MIR = cols[select_features_kbest]
scores = skb.scores_[select_features_kbest]
feature_scores = pd.DataFrame([(item, score) for item, score in zip(kbest_FR, scores)], columns=['feature', 'score'])
f2 = fig.add_subplot(122)
feature_scores.sort_values(by='score', ascending=True).plot(y = 'score', x = 'feature', kind='barh', 
                                                            ax = f2, fontsize=10, grid=True) 
plt.show()

Select Features by Embedded Methods

In addition to the return of the performance itself, some models has in their internal process some step to features select that best fit their proposal, and returns the features importance too. Thus, they provide two straightforward methods for feature selection and combine the qualities' of filter and wrapper methods. image Some of the most popular examples of these methods are LASSO, RIDGE, SVM, Regularized trees, Memetic algorithm, and Random multinomial logit.

In the case of Random Forest, some other models base on trees, we have two basic approaches implemented in the packages:

  1. Gini/Entropy Importance or Mean Decrease in Impurity (MDI)
  2. Permutation Importance or Mean Decrease in Accuracy
  3. Permutation with Shadow Features
  4. Gradient Boosting

Others models has concerns om multicollinearity problem and adding additional constraints or penalty to regularize. When there are multiple correlated features, as is the case with very many real life datasets, the model becomes unstable, meaning that small changes in the data can cause large changes in the model, making model interpretation very difficult on the regularization terms.

This applies to regression models like LASSO and RIDGE. In classifier cases, you can use 'SGDClassifier' where you can set the loss parameter to 'log' for Logistic Regression or 'hinge' for 'SVM'. In 'SGDClassifier' you can set the penalty to either of 'l1', 'l2' or 'elasticnet' which is a combination of both.

Let's start with more details and examples:

Feature Selection by Gradient Boosting

The LightGBM model the importance is calculated from, if 'split', result contains numbers of times the feature is used in a model, if 'gain', result contains total gains of splits which use the feature.

On the XGBoost model the importance is calculated by:

  • 'weight': the number of times a feature is used to split the data across all trees.
  • 'gain': the average gain across all splits the feature is used in.
  • 'cover': the average coverage across all splits the feature is used in.
  • 'total_gain': the total gain across all splits the feature is used in.
  • 'total_cover': the total coverage across all splits the feature is used in.

First measure is split-based and is very similar with the one given by for Gini Importance. But it doesn't take the number of samples into account.

The second measure is gain-based. It's basically the same as the Gini Importance implemented in R packages and in scikit-learn with Gini impurity replaced by the objective used by the gradient boosting model.

The cover, implemented exclusively in XGBoost, is counting the number of samples affected by the splits based on a feature.

get_score(fmap='', importance_type='weight') Get feature importance of each feature. Importance type can be defined as:

The default measure of both XGBoost and LightGBM is the split-based one. I think this measure will be problematic if there are one or two feature with strong signals and a few features with weak signals. The model will exploit the strong features in the first few trees and use the rest of the features to improve on the residuals. The strong features will look not as important as they actually are. While setting lower learning rate and early stopping should alleviate the problem, also checking gain-based measure may be a good idea.

Note that these measures are purely calculated using training data, so there's a chance that a split creates no improvement on the objective in the holdout set. This problem is more severe than in the random forest since gradient boosting models are more prone to over-fitting.

Feature importance scores can be used for feature selection in scikit-learn.

This is done using the SelectFromModel class that takes a model and can transform a dataset into a subset with selected features.

This class can take a previous trained model, such as one trained on the entire training dataset. It can then use a threshold to decide which features to select. This threshold is used when you call the transform() method on the SelectFromModel instance to consistently select the same features on the training dataset and the test dataset.

In [95]:
warnings.filterwarnings(action='ignore', category=DeprecationWarning)

# split data into train and test sets
X_train, X_test, y, y_test = train_test_split(df, y_train, test_size=0.30, random_state=101)

# fit model on all training data
#importance_type='gain'
model =  XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=1, gamma=0, max_delta_step=0, 
                      random_state=101, min_child_weight=1, missing=None, n_jobs=4,  
                      scale_pos_weight=1, seed=None, silent=True, subsample=1)


model.fit(X_train, y)
fig=plt.figure(figsize=(20,20))
ax = fig.add_subplot(121)
g = plot_importance(model, height=0.5, ax=ax)

# Using each unique importance as a threshold
thresholds = np.sort(np.unique(model.feature_importances_)) 
best = 1e36
colsbest = 31
my_model = model
threshold = 0

for thresh in thresholds:
    # select features using threshold
    selection = SelectFromModel(model, threshold=thresh, prefit=True)
    select_X_train = selection.transform(X_train)
    # train model
    selection_model =  XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=1, gamma=0, max_delta_step=0, 
                                    random_state=101, min_child_weight=1, missing=None, n_jobs=4, 
                                    scale_pos_weight=1, seed=None, silent=True, subsample=1)
    selection_model.fit(select_X_train, y)
    # eval model
    select_X_test = selection.transform(X_test)
    y_pred = selection_model.predict(select_X_test)
    predictions = [round(value) for value in y_pred]
    r2 = r2_score(y_test, predictions)
    mse = mean_squared_error(y_test, predictions)
    print("Thresh={:1.3f}, n={:d}, R2: {:2.2%} with MSE: {:.4f}".format(thresh, select_X_train.shape[1], r2, mse))
    if (best >= mse):
        best = mse
        colsbest = select_X_train.shape[1]
        my_model = selection_model
        threshold = thresh
        
ax = fig.add_subplot(122)
g = plot_importance(my_model,height=0.5, ax=ax, 
                    title='The best MSE: {:1.4f} with {:d} features'.\
                    format(best, colsbest))

feature_importances = [(score, feature) for score, feature in zip(model.feature_importances_, cols)]
XGBest = pd.DataFrame(sorted(sorted(feature_importances, reverse=True)[:colsbest]), columns=['Score', 'Feature'])
g = XGBest.plot(x='Feature', kind='barh', figsize=(20,10), fontsize=14, grid= True,
     title='Original feature importance from selected features')
plt.tight_layout(); plt.show()
XGBestCols = XGBest.iloc[:, 1].tolist()
Thresh=0.000, n=188, R2: 50.18% with MSE: 0.0832
Thresh=0.002, n=77, R2: 50.18% with MSE: 0.0832
Thresh=0.004, n=58, R2: 49.91% with MSE: 0.0836
Thresh=0.005, n=49, R2: 51.02% with MSE: 0.0818
Thresh=0.007, n=40, R2: 50.17% with MSE: 0.0832
Thresh=0.009, n=31, R2: 49.64% with MSE: 0.0841
Thresh=0.011, n=27, R2: 49.38% with MSE: 0.0845
Thresh=0.012, n=24, R2: 48.61% with MSE: 0.0858
Thresh=0.014, n=19, R2: 49.39% with MSE: 0.0845
Thresh=0.018, n=16, R2: 49.96% with MSE: 0.0836
Thresh=0.021, n=14, R2: 49.69% with MSE: 0.0840
Thresh=0.023, n=10, R2: 46.45% with MSE: 0.0894
Thresh=0.028, n=9, R2: 46.85% with MSE: 0.0887
Thresh=0.034, n=8, R2: 46.21% with MSE: 0.0898
Thresh=0.035, n=7, R2: 46.87% with MSE: 0.0887
Thresh=0.044, n=6, R2: 46.18% with MSE: 0.0899
Thresh=0.057, n=5, R2: 46.88% with MSE: 0.0887
Thresh=0.072, n=3, R2: 42.78% with MSE: 0.0955
Thresh=0.110, n=1, R2: 33.79% with MSE: 0.1106
In [96]:
bcols = set(pv_cols).union(set(RFEcv)).union(set(kbest_FR)).union(set(kbest_MIR)).union(set(XGBestCols)).union(set(SBS))
print('Features Selected by Filter Methods:\n')
print("Extra features select by Kbest_FR:", set(kbest_FR).\
     difference(set(pv_cols).union(set(RFEcv)).union(set(kbest_MIR)).union(set(XGBestCols)).union(set(SBS))), '\n')
print("Extra features select by Kbest_MIR:", set(kbest_MIR).\
      difference(set(pv_cols).union(set(RFEcv)).union(set(kbest_FR)).union(set(XGBestCols)).union(set(SBS))), '\n')
print('_'*75,'\nFeatures Selected by Wrappers Methods:\n')
print("Extra features select by pv_cols:", set(pv_cols).\
      difference(set(SBS).union(set(RFEcv)).union(set(kbest_MIR)).union(set(kbest_FR)).union(set(XGBestCols))),'\n')
print("Extra features select by RFEcv:", set(RFEcv).\
      difference(set(pv_cols).union(set(kbest_FR)).union(set(kbest_MIR)).union(set(XGBestCols)).union(set(SBS))), '\n')
print("Extra features select by SBS:", set(SBS).\
      difference(set(pv_cols).union(set(RFEcv)).union(set(kbest_MIR)).union(set(kbest_FR)).union(set(XGBestCols))), '\n')
print('_'*75,'\nFeatures Selected by Embedded Methods:\n')
print("Extra features select by XGBestCols:", set(XGBestCols).\
      difference(set(pv_cols).union(set(RFEcv)).union(set(kbest_MIR)).union(set(kbest_FR)).union(set(SBS))), '\n')
print('_'*75,'\nIntersection Features Selected:')
intersection = set(SBS).intersection(set(kbest_MIR)).intersection(set(RFEcv)).intersection(set(pv_cols)).\
               intersection(set(kbest_FR)).intersection(set(XGBestCols))
print(intersection, '\n')
print('_'*75,'\nUnion All Features Selected:')
print('Total number of features selected:', len(bcols))
print('\n{0:2d} features removed if use the union of selections: {1:}'.\
      format(len(cols.difference(bcols)), cols.difference(bcols)))
Features Selected by Filter Methods:

Extra features select by Kbest_FR: {'Heating_Grav', 'Alley_NA', 'Heating_GasA', 'Condition1_Feedr'} 

Extra features select by Kbest_MIR: {'Exterior2nd_CmentBd', 'ExterCond', 'Exterior1st_CemntBd'} 

___________________________________________________________________________ 
Features Selected by Wrappers Methods:

Extra features select by pv_cols: {'SaleCondition_Alloca', 'Neighborhood_NPkVill', 'Neighborhood_Blueste', 'Exterior2nd_BrkFace', 'HouseStyle_15Unf', 'MSSubClass_75', 'MSSubClass_190', 'MSSubClass_85', 'MSSubClass_180', 'Condition2_Artery', 'MoSold_12', 'Neighborhood_Veenker', 'Exterior1st_WdShing', 'MiscFeature_Pool', 'Exterior1st_Stucco', 'Neighborhood_SawyerW', 'RoofStyle_Gambrel', 'MSSubClass_40', 'Neighborhood_SWISU', 'MoSold_6', 'MiscFeature_Othr', 'RoofMatl_WdShake', 'Exterior2nd_Stone', 'Condition1_RRAn', 'MoSold_1', 'RoofStyle_Mansard', 'RoofStyle_Shed', 'Exterior1st_AsphShn', 'MoSold_4', 'RoofMatl_TarGrv', 'Neighborhood_Blmngtn', 'SaleCondition_AdjLand', 'Street'} 

Extra features select by RFEcv: {'Alley_Pave', 'Exterior1st_Plywood', 'MiscFeature_Shed'} 

Extra features select by SBS: set() 

___________________________________________________________________________ 
Features Selected by Embedded Methods:

Extra features select by XGBestCols: {'Exterior1st_BrkComm', 'Condition1_PosA'} 

___________________________________________________________________________ 
Intersection Features Selected:
{'KitchenQual', 'GarageFinish', 'TotBathrooms', 'TotRmsAbvGrd', 'Neighborhood_BrkSide', 'ConstructArea', 'Neighborhood_IDOTRR'} 

___________________________________________________________________________ 
Union All Features Selected:
Total number of features selected: 157

31 features removed if use the union of selections: Index(['BldgType_2fmCon', 'BldgType_Twnhs', 'Condition1_RRNe',
       'Condition1_RRNn', 'Condition2_Feedr', 'Condition2_Norm',
       'Condition2_PosA', 'Condition2_PosN', 'Electrical_FuseP',
       'Exterior2nd_AsphShn', 'Exterior2nd_BrkCmn', 'Exterior2nd_ImStucc',
       'Exterior2nd_Stucco', 'Exterior2nd_WdShng', 'Foundation_Stone',
       'Foundation_Wood', 'FullBath', 'GarageType_Basment', 'Heating_Wall',
       'HouseStyle_25Unf', 'LandContour_HLS', 'LandContour_Low',
       'LotConfig_FR3', 'MSZoning_RH', 'MiscVal', 'SaleType_CWD',
       'SaleType_Con', 'SaleType_ConLD', 'SaleType_ConLI', 'SaleType_ConLw',
       'SaleType_Oth'],
      dtype='object')

Separate data for modeling

image

In [97]:
totalCols = list(bcols.union(set(colsP)))
train = all_data.loc[all_data.SalePrice>0 , list(totalCols)].reset_index(drop=True, inplace=False)
y_train = all_data.SalePrice[all_data.SalePrice>0].reset_index(drop=True, inplace=False)
test = all_data.loc[all_data.SalePrice==0 , list(totalCols)].reset_index(drop=True, inplace=False)

Feature Selection into the Pipeline

Since we have a very different selection of features selection methods, from the results it may be interesting keeping only the removal of collinear and multicollinear, and can decide with we must have the pre polynomials and apply PCA or not. We can still improve the results through hyper parameterization and cross-validation.

In [98]:
class select_fetaures(object): # BaseEstimator, TransformerMixin, 
    def __init__(self, select_cols):
        self.select_cols_ = select_cols
    
    def fit(self, X, Y ):
        print('Recive {0:2d} features...'.format(X.shape[1]))
        return self

    def transform(self, X):
        print('Select {0:2d} features'.format(X.loc[:, self.select_cols_].shape[1]))
        return X.loc[:, self.select_cols_]    

    def fit_transform(self, X, Y):
        self.fit(X, Y)
        df = self.transform(X)
        return df 
        #X.loc[:, self.select_cols_]    

    def __getitem__(self, x):
        return self.X[x], self.Y[x]
        

Compressing Data via Dimensionality Reduction

image

PCA

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. If there are n observations with p variables, then the number of distinct principal components is min(n-1,p). This transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables. image.png

Let's see how PCA can reduce the dimensionality of our dataset with minimum of lose information:

In [99]:
scale = RobustScaler() 
df = scale.fit_transform(train)

pca = PCA().fit(df) # whiten=True
print('With only 120 features: {:6.4%}'.format(sum(pca.explained_variance_ratio_[:120])),"%\n")

print('After PCA, {:3} features only not explained {:6.4%} of variance ratio from the original {:3}'.format(120,
                                                                                    (sum(pca.explained_variance_ratio_[120:])),
                                                                                    df.shape[1]))
#del df, data, all_data
With only 120 features: 99.8846% %

After PCA, 120 features only not explained 0.1154% of variance ratio from the original 187

Modeling

image First, we start to looking at different approaches to implement linear regression models, and use hyper parametrization, cross validation and compare the results between different erros measures.

Model Hyper Parametrization

Evaluate Results

Mean Squared Error (MSE)

In statistics, MSE or mean squared deviation (MSD) of an estimator measures the average of the squares of the errors. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive (and not zero) is because of randomness or because the estimator does not account for information that could produce a more accurate estimate. image.png Which is simply the average value of the SSE cost function that we minimize to fit the linear regression model. The MSE is useful to for comparing different regression models or for tuning their parameters via a grid search and cross-validation.

Root-Mean-Square Error (RMSE)

The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is a frequently used measure of the differences between values predicted by a model or an estimator and the values observed. image.png

Mean Absolute Error (MAE)

In statistics, mean absolute error (MAE) is a measure of difference between two continuous variables, is also the average horizontal distance between each point and the identity line. image.png

Coefficient of determination ( R2 ) image.png Wheres: SSres is the sum of squares of residuals, also called the residual sum of squares: image and SStot is the total sum of squares (proportional to the variance of the data): image Which can be understood as a standardized version of the MSE, for better interpretability of the model performance (try to say that tree times and faster!). In other words, R2 is the fraction of response variance that is captured by the model.

For the training dataset, R2 is bounded between 0 and 1, but it can become negative for the test set. If R2 =1 , the model fits the data perfectly with a corresponding MSE = 0 .

In [100]:
def get_results(model, name='NAN', log=False):
    
    rcols = ['Name','Model', 'BestParameters', 'Scorer', 'Index', 'BestScore', 'BestScoreStd', 'MeanScore', 
             'MeanScoreStd', 'Best']
    res = pd.DataFrame(columns=rcols)
    
    results = gs.cv_results_
    modelo = gs.best_estimator_

    scoring = {'MEA': 'neg_mean_absolute_error', 'R2': 'r2', 'RMSE': 'neg_mean_squared_error'}

    for scorer in sorted(scoring):
        best_index = np.nonzero(results['rank_test_%s' % scoring[scorer]] == 1)[0][0]
        if scorer == 'RMSE': 
            best = np.sqrt(-results['mean_test_%s' % scoring[scorer]][best_index])
            best_std = np.sqrt(results['std_test_%s' % scoring[scorer]][best_index])
            scormean = np.sqrt(-results['mean_test_%s' % scoring[scorer]].mean())
            stdmean = np.sqrt(results['std_test_%s' % scoring[scorer]].mean())
            if log:
                best = np.expm1(best)
                best_std = np.expm1(best_std)
                scormean = np.expm1(scormean)
                stdmean = np.expm1(stdmean)
        elif scorer == 'MEA':
            best = (-results['mean_test_%s' % scoring[scorer]][best_index])
            best_std = results['std_test_%s' % scoring[scorer]][best_index]
            scormean =(-results['mean_test_%s' % scoring[scorer]].mean())
            stdmean = results['std_test_%s' % scoring[scorer]].mean()
            if log:
                best = np.expm1(best)
                best_std = np.expm1(best_std)
                scormean = np.expm1(scormean)
                stdmean = np.expm1(stdmean)
        else:
            best = results['mean_test_%s' % scoring[scorer]][best_index]*100
            best_std = results['std_test_%s' % scoring[scorer]][best_index]*100
            scormean = results['mean_test_%s' % scoring[scorer]].mean()*100
            stdmean = results['std_test_%s' % scoring[scorer]].mean()*100
        
        r1 = pd.DataFrame([(name, modelo, gs.best_params_, scorer, best_index, best, best_std, scormean, 
                            stdmean, gs.best_score_)],
                          columns = rcols)
        res = res.append(r1)
        
    if log:
        bestscore = np.expm1(np.sqrt(-gs.best_score_))
    else:
        bestscore = np.sqrt(-gs.best_score_)
        
    print("Best Score: {:.6f}".format(bestscore))
    print('---------------------------------------')
    print('Best Parameters:')
    print(gs.best_params_)
    
    return res

Residuals Plots

The plot of differences or vertical distances between the actual and predicted values. Commonly used graphical analysis for diagnosing regression models to detect nonlinearity and outliers, and to check if the errors are randomly distributed. image.png Some points for help you in your analysis:

  • Since Residual = Observed – Predicted positive values for the residual (on the y-axis) mean the prediction was too low, and negative values mean the prediction was too high; 0 means the guess was exactly correct.

  • They're pretty symmetrically distributed, tending to cluster towards the middle of the plot.

    For a good regression model, we would expect that the errors are randomly distributed and the residuals should be randomly scattered around the centerline.

  • Detect outliers, which are represented by the points with a large deviation from the centerline.

    Now, you might be wondering how large a residual has to be before a data point should be flagged as being an outlier. The answer is not straightforward, since the magnitude of the residuals depends on the units of the response variable. That is, if your measurements are made in pounds, then the units of the residuals are in pounds. And, if your measurements are made in inches, then the units of the residuals are in inches. Therefore, there is no one "rule of thumb" that we can define to flag a residual as being exceptionally unusual.

    There's a solution to this problem. We can make the residuals unitlessby dividing them by their standard deviation. In this way we create what are called standardized residuals. They tell us how many standard deviations above — if positive — or below — if negative — a data point is from the estimated regression line.

  • They're clustered around the lower single digits of the y-axis (e.g., 0.5 or 1.5, not 30 or 150).

    Again, doesn't exist a unique rule for all cases. But, recall that the empirical rule tells us that, for data that are normally distributed, 95% of the measurements fall within 2 standard deviations of the mean. Therefore, any observations with a standardized residual greater than 2 or smaller than -2 might be flagged for further investigation. It is important to note that by using this "greater than 2, smaller than -2 rule," approximately 5% of the measurements in a data set will be flagged even though they are perfectly fine. It is in your best interest not to treat this rule of thumb as a cut-and-dried, believe-it-to-the-bone, hard-and fast rule! So, in most cases it may be more practical to investigate further any observations with a standardized residual greater than 3 or smaller than -3. Using the empirical rule we would expect only 0.2% of observations to fall into this category.

  • If we see patterns in a residual plot, it means that our model is unable to capture some explanatory information.

    A special case is any systematic (non-random) pattern. It is sufficient to suggest that the regression function is not linear. For example, if the residuals depart from 0 in some systematic manner, such as being positive for small x values, negative for medium x values, and positive again for large x values.

  • Non-constant error variance shows up on a residuals vs. fits (or predictor) plot in any of the following ways:

    • The plot has a "fanning" effect. That is, the residuals are close to 0 for small x values and are more spread out for large x values.
    • The plot has a "funneling" effect. That is, the residuals are spread out for small x values and close to 0 for large x values.
    • Or, the spread of the residuals in the residuals vs. fits plot varies in some complex fashion.
In [101]:
def resilduals_plots(lr, X, Y, log=False):
    y_pred = lr.predict(X)
    residual = pd.DataFrame()
    residual['Predict'] = y_pred
    residual['Residual'] = Y - y_pred
    residual['Predicted'] = np.expm1(residual.Predict)
    residual['StdResidual'] = np.expm1(Y) - residual.Predicted
    residual.StdResidual = residual.StdResidual / residual.StdResidual.std()
    residual['IDX'] = X.index
    
    if log:
        fig = plt.figure(figsize=(20,10))
        ax = fig.add_subplot(121)
        g = sns.regplot(y='Residual', x='Predict', data = residual, order=1, ax = ax) 
        plt.xlabel('Log Predicted Values')
        plt.ylabel('Log Residuals')
        plt.hlines(y=0, xmin=min(Y)-1, xmax=max(Y)+1, lw=2, color='red')
        plt.xlim([min(Y)-1, max(Y)+1])

        ax = fig.add_subplot(122)
        g = sns.regplot(y='StdResidual', x='Predicted', data = residual, order=1, ax = ax) 
        plt.xlabel('Predicted Values')
        plt.ylabel('Standardized Residuals')
        plt.hlines(y=0, xmin=np.expm1(min(Y))-1, xmax=np.expm1(max(Y))+1, lw=2, color='red')
        plt.xlim([np.expm1(min(Y))-1, np.expm1(max(Y))+1])
    else:
        residual.StdResidual = residual.Residual / residual.Residual.std()
        residual.drop(['Residual', 'Predicted'], axis = 1, inplace=True)
        g = sns.regplot(y='StdResidual', x='Predict', data = residual, order=1) 
        plt.xlabel('Predicted Values')
        plt.ylabel('Standardized Residuals')
        plt.hlines(y=0, xmin=min(Y)-1, xmax=max(Y)+1, lw=2, color='red')
        plt.xlim([min(Y)-1, max(Y)+1])

    plt.show()  

    return residual

Model Hiperparametrization

Lasso (Least Absolute Shrinkage and Selection Operator)

Lasso was introduced in order to improve the prediction accuracy and interpretability of regression models by include a with L1 prior as regularizer and altering the model fitting process to select only a subset of the provided covariates for use in the final model rather than using all of them. Lasso) was originally formulated for least squares models and this simple case reveals a substantial amount about the behavior of the estimator, including its relationship to ridge regression and best subset selection and the connections between Lasso coefficient estimates and so-called soft thresholding. It also reveals that the coefficient estimates need not be unique if covariates are collinear.

Prior to lasso, the most widely used method for choosing which covariates to include was stepwise selection, which only improves prediction accuracy in certain cases, such as when only a few covariates have a strong relationship with the outcome. However, in other cases, it can make prediction error worse. Also, at the time, ridge regression was the most popular technique for improving prediction accuracy. Ridge regression improves prediction error by shrinking large regression coefficients in order to reduce overfitting, but it does not perform covariate selection and therefore does not help to make the model more interpretable.

Lasso is able to achieve both of these goals by forcing the sum of the absolute value of the regression coefficients to be less than a fixed value, which depending on the regularization strength, certain weights can become zero, which makes the Lasso also useful as a supervised feature selection technique, by effectively choosing a simpler model that does not include those coefficients. However, a limitation of the Lasso is that it selects at most n variables if m > n.

This idea is similar to ridge regression, in which the sum of the squares of the coefficients is forced to be less than a fixed value, though in the case of ridge regression, this only shrinks the size of the coefficients, it does not set any of them to zero.

The optimization objective for Lasso is: (1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1

Technically the Lasso model is optimizing the same objective function as the Elastic Net with l1_ratio=1.0, no L2 penalty.

This characteristics turn Lasso a interesting alternative approach that can lead to sparse models.

From sklearn its most important parameters are:

  • alpha: Constant that multiplies the L1 term. Defaults to 1.0. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Given this, you should use the LinearRegression object.
  • max_iter: The maximum number of iterations
  • selection: If set to 'random', a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to 'random') often leads to significantly faster convergence especially when tol is higher than 1e-4.
  • tol: The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.

So, I run many times my model, with different parameters, selection features, reduction or not and with and without log1P transformation of Sales Price. Below I preserve the code with the best options and with few possibilities for you can see the grid search cv in action, but I encourage you to make changes and see for yourself.

In [102]:
model = Pipeline([
        ('pca', PCA(random_state = 101)),
        ('model', Lasso(random_state = 101))]) 

SEL = list(set(RFEcv).union(set(colsP)))
n_components = [len(SEL)-5, len(SEL)-3, len(SEL)] 
whiten = [False, True]
max_iter = [5] #, 10, 100, 200, 300, 400, 500, 600]  
alpha = [0.0003, 0.0007, 0.0005, 0.05, 0.5, 1.0]
selection = ['random', 'cyclic'] 
tol = [2e-03, 0.003, 0.001, 0.0005]
param_grid =\
            dict(
                  model__alpha = alpha
                  ,model__max_iter = max_iter
                  ,model__selection = selection
                  ,model__tol = tol
                  ,pca__n_components = n_components
                  ,pca__whiten = whiten 
                ) 

gs = GridSearchCV(estimator = model, param_grid = param_grid, refit = 'neg_mean_squared_error' #, iid=False
                   , scoring=list(['neg_mean_squared_error' , 'neg_mean_absolute_error', 'r2']) 
                   ,cv=5, verbose=1, n_jobs=4)

lasso = Pipeline([
        ('sel', select_fetaures(select_cols=SEL)), 
        ('scl', RobustScaler()),
        ('gs', gs)
 ])

lasso.fit(train,y_train)

results = get_results(lasso, 'lasso Lg1', log=True)
display(results.loc[:, 'Scorer' : 'MeanScoreStd'])
r = resilduals_plots(lasso, train, y_train, log=True)
Recive 187 features...
Select 109 features
Fitting 5 folds for each of 288 candidates, totalling 1440 fits
[Parallel(n_jobs=4)]: Done  49 tasks      | elapsed:    4.9s
[Parallel(n_jobs=4)]: Done 349 tasks      | elapsed:   11.1s
[Parallel(n_jobs=4)]: Done 849 tasks      | elapsed:   20.8s
[Parallel(n_jobs=4)]: Done 1433 out of 1440 | elapsed:   32.6s remaining:    0.1s
[Parallel(n_jobs=4)]: Done 1440 out of 1440 | elapsed:   32.7s finished
Best Score: 0.116418
---------------------------------------
Best Parameters:
{'model__alpha': 0.0007, 'model__max_iter': 5, 'model__selection': 'random', 'model__tol': 0.002, 'pca__n_components': 106, 'pca__whiten': True}
Scorer Index BestScore BestScoreStd MeanScore MeanScoreStd
0 MEA 51 0.080 0.003 0.153 0.006
0 R2 51 92.364 0.223 69.284 0.743
0 RMSE 51 0.116 0.026 0.247 0.065
Select 109 features

As you can see our Lasso has good performance with RFEcv selection features plus polynomials features, with: MAE 0.080, RMSE 0.1164 and R2 of 92.36%.

From the residuals plot with log sales price: We saw that most are plot randomly scattered around the centerline. You can think that some outliers that we didn't remove it can be detect! Are they the points with a large deviation from the centerline, the most extern points? However, notice that our deviation s no grater the 0.42 and no below to -0.8, they're clustered around the lower single digits of the y-axis.

In general, there aren't any clear patterns, but with more attention we can observe some patterns in a few points, it means that our model is unable to capture some explanatory information, but as you can see, it is not easy to solve then. Maybe other model, or stack models or some feature that we drooped can help?

In addition, the log transformation is also used to handle cases where the expected distribution of the dependent variable leads to a funnel-like residue plot. Note that our residue plot without the log actually has a slight funnel look, but note that the model was trained and validated with the log transformation of the sales prices. So the transformation actually dealt with the problem of the distribution of the variable, but seems to have had little effect on the deviations of the residual

So, you could decide to cut some of these outliers, or simply go ahead and believe that your model is performing well and that the transformation in log1P of the selling price was successful. But calm, as you can see from the chart of errors on the right, things are not quite like that.

In fact I have seen many books and some colleagues do just that, turn the dependent variable into something like log, check for QQ testing that there has been improvement by reversing it to normal distribution, plotting errors and cutting or moving on believing it to be correct, but when you has applied a transformation on your response variable it is recommended that you reverse it when evaluating the errors, and this is what we did in the right chart.

So you can see that the residues have a pattern, curiously linear. and is more clear to see some outliers that really important to drop. But for now, let's ignore that we already know this, in order to show that if we cut some of the biggest deviations from the log observations perspective:

In [103]:
fica =  list(r.IDX[abs(r.Residual)<=0.3])
print('Outliers removed:', r.shape[0]-len(fica))
t = train.iloc[fica, :].reset_index(drop=True, inplace=False)
y_t = y_train.iloc[fica].reset_index(drop=True, inplace=False)

lasso.fit(t, y_t)
results = get_results(lasso, 'lasso Lg2', log=True)
display(results.loc[:, 'Scorer' : 'MeanScoreStd'])
r = resilduals_plots(lasso, t, y_t, log=True)
del  t, y_t, fica
Outliers removed: 18
Recive 187 features...
Select 109 features
Fitting 5 folds for each of 288 candidates, totalling 1440 fits
[Parallel(n_jobs=4)]: Done  46 tasks      | elapsed:    5.0s
[Parallel(n_jobs=4)]: Done 346 tasks      | elapsed:   10.1s
[Parallel(n_jobs=4)]: Done 846 tasks      | elapsed:   18.1s
Best Score: 0.094187
---------------------------------------
Best Parameters:
{'model__alpha': 0.0007, 'model__max_iter': 5, 'model__selection': 'cyclic', 'model__tol': 0.002, 'pca__n_components': 106, 'pca__whiten': True}
[Parallel(n_jobs=4)]: Done 1440 out of 1440 | elapsed:   28.2s finished
Scorer Index BestScore BestScoreStd MeanScore MeanScoreStd
0 MEA 75 0.072 0.004 0.146 0.006
0 R2 51 94.577 0.440 70.894 0.611
0 RMSE 75 0.094 0.023 0.232 0.060
Select 109 features

We would have an improvement as you can see from the MAE with 0.072, RMSE is 0.094 and R2 is 94.58%. However, our residual plot without log shows that this actually didn't had a good effect, where we continue have outliers, the linear pattern has a little increase and and the funnel shape intensified.

So, let's see how it performs a model without transforming the sales price:

In [104]:
y = np.expm1(y_train)
lasso.fit(train, y)
results = get_results(lasso, 'lasso N1')
display(results.loc[:, 'Scorer' : 'MeanScoreStd'])
r = resilduals_plots(lasso, train, y)
Recive 187 features...
Select 109 features
Fitting 5 folds for each of 288 candidates, totalling 1440 fits
[Parallel(n_jobs=4)]: Done  49 tasks      | elapsed:    4.8s
[Parallel(n_jobs=4)]: Done 647 tasks      | elapsed:   14.3s
Best Score: 21504.641415
---------------------------------------
Best Parameters:
{'model__alpha': 1.0, 'model__max_iter': 5, 'model__selection': 'random', 'model__tol': 0.002, 'pca__n_components': 106, 'pca__whiten': False}
[Parallel(n_jobs=4)]: Done 1440 out of 1440 | elapsed:   26.1s finished
Scorer Index BestScore BestScoreStd MeanScore MeanScoreStd
0 MEA 242 14582.133 854.804 14647.674 858.145
0 R2 242 92.600 0.579 92.537 0.526
0 RMSE 242 21504.641 7071.982 21600.052 7013.087
Select 109 features

You then see these results and then you are disappointed, so much discussion to make R2 better only from 92.364 to 92.60% and didn't leave to have the funnel shape. But it confirm that we don't need transform our sales price. In the other hand, note that the residual graph does not have the slightly linear pattern and the funnel shape is more strangled. And more, now we can also apply the outliers detection rule.

In [105]:
fica =  list(r.IDX[abs(r.StdResidual)<3]) # =2.7
print('Outliers removed:', r.shape[0]-len(fica))
t = train.iloc[fica, :].reset_index(drop=True, inplace=False)
y_l = y_train.iloc[fica].reset_index(drop=True, inplace=False)
y_n = np.expm1(y_l)

lasso.fit(t, y_n)
results = get_results(lasso, 'lasso N2')
display(results.loc[:, 'Scorer' : 'MeanScoreStd'])
r2 = resilduals_plots(lasso, t, y_n)
del fica, r2
Outliers removed: 23
Recive 187 features...
Select 109 features
Fitting 5 folds for each of 288 candidates, totalling 1440 fits
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    5.0s
[Parallel(n_jobs=4)]: Done 336 tasks      | elapsed:   10.0s
[Parallel(n_jobs=4)]: Done 836 tasks      | elapsed:   18.0s
Best Score: 17597.954460
---------------------------------------
Best Parameters:
{'model__alpha': 1.0, 'model__max_iter': 5, 'model__selection': 'random', 'model__tol': 0.002, 'pca__n_components': 109, 'pca__whiten': False}
[Parallel(n_jobs=4)]: Done 1440 out of 1440 | elapsed:   28.2s finished
Scorer Index BestScore BestScoreStd MeanScore MeanScoreStd
0 MEA 244 12966.314 1004.908 13033.307 980.243
0 R2 242 94.283 0.795 94.212 0.792
0 RMSE 244 17597.954 6826.881 17714.918 6968.298
Select 109 features

Note that although our R2 is not higher than what we get in the cut over the log observations, now you can see that the deletion of only 23 outliers made more sense, being more effective in improving the model and did not create any slightly linear pattern in residues, but it seems to widen a bit to the right of our funnel.

So you see, this confirms that this does not mean that anyone using log1p has failed, but shows that without the help of the residuals plot and the use of standardized metrics, it would be very difficult to identify these 23 outliers, more still decide to cut them, as well as require more tests to confirm the model.

So what we have just seen is another procedure for identifying and cutting outliers with the intention of improving the performance and generalization of the model. However, this procedure has to be taken care of; in addition, we might not have done it now, but the one left for a final stage where we would have already selected the model or built our stacked model. Thus, we do not run the risk of excluding observations that are outliers in this model, but which may be treated in another. Remember that the main outliers, the most damaging, had already been identified and eliminated in the EDA phase.

So at this point we will not cut any additional outlier, but we will not make use of the sales price transformation in your log1p, and thus avoid the linear pattern of the residuals.

In [106]:
y_log = y_train.copy()
y_train = np.expm1(y_train)

lasso.fit(train, y_train)

results = get_results(lasso, 'lasso', log=False)
display(results.loc[:, 'Scorer' : 'MeanScoreStd'])
r = resilduals_plots(lasso, train, y_train, log=False)
Recive 187 features...
Select 109 features
Fitting 5 folds for each of 288 candidates, totalling 1440 fits
[Parallel(n_jobs=4)]: Done  49 tasks      | elapsed:    5.2s
[Parallel(n_jobs=4)]: Done 647 tasks      | elapsed:   18.2s
Best Score: 21504.641415
---------------------------------------
Best Parameters:
{'model__alpha': 1.0, 'model__max_iter': 5, 'model__selection': 'random', 'model__tol': 0.002, 'pca__n_components': 106, 'pca__whiten': False}
[Parallel(n_jobs=4)]: Done 1440 out of 1440 | elapsed:   33.5s finished
Scorer Index BestScore BestScoreStd MeanScore MeanScoreStd
0 MEA 242 14582.133 854.804 14647.674 858.145
0 R2 242 92.600 0.579 92.537 0.526
0 RMSE 242 21504.641 7071.982 21600.052 7013.087
Select 109 features

XGBRegressor

XGBoost is an open-source software library which provides a gradient boosting framework for C++, Java, Python, R, and Julia. It works on Linux, Windows, and macOS. From the project description, it aims to provide a "Scalable, Portable and Distributed Gradient Boosting (GBM, GBRT, GBDT) Library". Other than running on a single machine, it also supports the distributed processing frameworks Apache Hadoop, Apache Spark, and Apache Flink. It has gained much popularity and attention recently as it was the algorithm of choice for many winning teams of a number of machine learning competitions.

The XGBRegressor is a scikit-learn wrapper interface for running a regressor on XGBoost. Its most important parameters are:

  • max_depth: Maximum tree depth for base learners.
  • learning_rate: Boosting learning rate (the xgb's "eta")
  • n_estimators: Number of boosted trees to fit.
  • silent: Whether to print messages while running boosting.
  • objective: Specify the learning task and the corresponding learning objective or a custom objective function to be used.
  • booster: Specify which booster to use: 'gbtree', 'gblinear' or 'dart'.
  • n_jobs: Number of parallel threads used to run xgboost.
  • gamma: Minimum loss reduction required to make a further partition on a leaf node of the tree.
  • min_child_weight: Minimum sum of instance weight needed in a child.
  • max_delta_step: Maximum delta step we allow each tree's weight estimation to be.
  • subsample: Subsample ratio of the training instance.
  • colsample_bytree: Subsample ratio of columns when constructing each tree.
  • colsample_bylevel: Subsample ratio of columns for each split, in each level.
  • reg_alpha: L1 regularization term on weights.
  • reg_lambda: L2 regularization term on weights (xgb's lambda).
  • scale_pos_weight: Balancing of positive and negative weights.
  • base_score: The initial prediction score of all instances, global bias.
In [107]:
model = Pipeline([
        ('pca', PCA(random_state = 101)),
        ('model', XGBRegressor(random_state=101,  n_jobs=2, silent=False))])

SEL = list(set(RFEcv).union(set(colsP)))
n_components = [len(SEL)-18, len(SEL)-19, len(SEL)-20] 
whiten = [True] #, False]
n_est = [3500] # [500, 750, 1000, 2000, 2006] # np.arange(1997, 2009, 3) # 
max_depth = [3] #, 4]
learning_rate = [0.01] #, 0.03] #, 0.1, 0.05
reg_lambda = [1] #0.1, 1e-06, 1e-04, 1e-03, 1e-02, 1e-05, 1, 0.0] 
reg_alpha= [1] # , 0.5, 1, 0.0]
booster = ['gblinear'] #'dart', 'gbtree']  
objective = ['reg:tweedie'] #, 'reg:linear', 'reg:gamma']

param_grid =\
            dict(
                  pca__n_components = n_components,
                  pca__whiten = whiten, 
                  model__n_estimators= n_est
                  ,model__booster = booster
                  ,model__objective = objective
                  ,model__learning_rate = learning_rate
                  ,model__reg_lambda = reg_lambda
                  ,model__reg_alpha = reg_alpha
                  ,model__max_depth = max_depth
                ) 

gs = GridSearchCV(estimator = model, param_grid = param_grid, refit = 'neg_mean_squared_error'
                   , scoring=list(['neg_mean_squared_error' , 'neg_mean_absolute_error', 'r2']) 
                   ,cv=5, verbose=1, n_jobs=4)
 
XGBR = Pipeline([
        ('sel', select_fetaures(select_cols=SEL)),
        ('scl', RobustScaler()),
        ('gs', gs)
 ])

XGBR.fit(train, y_train)

res = get_results(XGBR, 'XGBRegressor', log=False)
resilduals_plots(XGBR, train, y_train, log=False)
results = pd.concat([results, res], axis=0)
res.loc[:, 'Scorer' : 'MeanScoreStd']
Recive 187 features...
Select 109 features
Fitting 5 folds for each of 3 candidates, totalling 15 fits
[Parallel(n_jobs=4)]: Done  15 out of  15 | elapsed:   19.9s finished
Best Score: 19959.960186
---------------------------------------
Best Parameters:
{'model__booster': 'gblinear', 'model__learning_rate': 0.01, 'model__max_depth': 3, 'model__n_estimators': 3000, 'model__objective': 'reg:tweedie', 'model__reg_alpha': 1, 'model__reg_lambda': 1, 'pca__n_components': 90, 'pca__whiten': True}
Select 109 features
Out[107]:
Scorer Index BestScore BestScoreStd MeanScore MeanScoreStd
0 MEA 1 13333.080 762.545 13345.617 742.448
0 R2 1 93.615 0.541 93.606 0.531
0 RMSE 1 19959.960 6373.905 19976.697 6404.169

From the results, we can saw that XGB was better than Lasso, but it is much more computationally expensive, especially for hyper parameterization.

From the residuals plots we can observe some linear pattern. Although this model has generated this unwanted residue pattern it seems to have been able to capture some nonlinear patterns. We will have to check if a stake model will be able to produce better results than the two models individually.

Gradient Boosting Regressor

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

On the sklearn implementation, in each stage a regression tree is fit on the negative gradient of the given loss function. Its most important parameters are:

  • loss: loss function to be optimized. 'ls' refers to least squares regression. 'lad' (least absolute deviation) is a highly robust loss function solely based on order information of the input variables. 'huber' is a combination of the two. 'quantile' allows quantile regression (use alpha to specify the quantile).
  • learning_rate: learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off between learning_rate and n_estimators.
  • n_estimators: The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.
  • criterion: The function to measure the quality of a split. Supported criteria are 'friedman_mse' for the mean squared error with improvement score by Friedman, 'mse' for mean squared error, and 'mae' for the mean absolute error. The default value of 'friedman_mse' is generally the best as it can provide a better approximation in some cases.
  • min_samples_split: The minimum number of samples required to split an internal node:
  • min_samples_leaf: The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
  • max_depth: maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables.
  • alpha: The alpha-quantile of the huber loss function and the quantile loss function. Only if loss='huber' or loss='quantile'.
  • tol: Tolerance for the early stopping. When the loss is not improving by at least tol for n_iter_no_change iterations (if set to a number), the training stops.
In [108]:
model = Pipeline([
        ('pca', PCA(random_state = 101)),
        ('model', GradientBoostingRegressor(random_state=101))])

SEL = list(set(XGBestCols).union(set(colsP)))
n_components = [len(SEL)] 
whiten = [True] #, False]
n_est = [3000]
learning_rate = [0.05] #, 0.01, 0.1, 0.005]
loss = ['huber'] #, 'ls', 'lad', 'quantile']
max_features = ['auto'] #, 'sqrt', 'log2']
max_depth = [3] #, 2] # , 5]
min_samples_split = [3] #, 4] 
min_samples_leaf = [3] # , 3, 2 ,4 ]
criterion = ['friedman_mse'] #, 'mse', 'mae']
alpha = [0.8] #, 0.75, 0.9, 0.7] 

param_grid =\
            dict(
                  #pca__n_components = n_components,
                  pca__whiten = whiten, 
                   model__n_estimators= n_est 
                  ,model__learning_rate = learning_rate
                  ,model__loss = loss
                  ,model__criterion = criterion
                  ,model__max_depth = max_depth
                  ,model__alpha = alpha
                  ,model__max_features = max_features
                  ,model__min_samples_split = min_samples_split
                  ,model__min_samples_leaf = min_samples_leaf
                   )

gs = GridSearchCV(estimator = model, param_grid = param_grid, refit = 'neg_mean_squared_error'
                  , scoring=list(['neg_mean_squared_error' , 'neg_mean_absolute_error', 'r2']) 
                  ,cv=5, verbose=1, n_jobs=4)
 
GBR = Pipeline([
        ('sel', select_fetaures(select_cols=SEL)),
        ('scl', RobustScaler()),
        ('gs', gs)
 ])

GBR.fit(train, y_train)
res = get_results(GBR, 'GBR' , log=False)
resilduals_plots(GBR, train, y_train, log=False)
results = pd.concat([results, res], axis=0)
res.loc[:, 'Scorer' : 'MeanScoreStd']
Recive 187 features...
Select 79 features
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:  1.6min finished
Best Score: 23873.764958
---------------------------------------
Best Parameters:
{'model__alpha': 0.8, 'model__criterion': 'friedman_mse', 'model__learning_rate': 0.05, 'model__loss': 'huber', 'model__max_depth': 3, 'model__max_features': 'auto', 'model__min_samples_leaf': 3, 'model__min_samples_split': 3, 'model__n_estimators': 3000, 'pca__whiten': True}
Select 79 features
Out[108]:
Scorer Index BestScore BestScoreStd MeanScore MeanScoreStd
0 MEA 0 15869.962 501.335 15869.962 501.335
0 R2 0 90.873 0.984 90.873 0.984
0 RMSE 0 23873.765 8664.820 23873.765 8664.820

We already have models with better performance than this one, so let's continue searching for another that can help more.

ElasticNet

In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods. The elastic net method overcomes the limitations of the Lasso method which uses a penalty function based on image Use of this penalty function has several limitations. For example, in the 'large p, small n' case (high-dimensional data with few examples, what has looked like this case), the Lasso selects at most n variables before it saturates. Also if there is a group of highly correlated variables, then the Lasso tends to select one variable from a group and ignore the others. To overcome these limitations, the elastic net adds a quadratic part to the penalty. The estimates from the elastic net method are defined by: image

  • Has a L1 penalty to generate sparsity. if we set l1_ratio to 1.0, the ElasticNet regressor would be equal to Lasso regression. Currently, l1_ratio <= 0.01 is not reliable, unless you supply your own sequence of alpha.
  • alpha: Constant that multiplies the penalty terms.
  • Has a L2 penalty to overcome some of the limitations of the Lasso, such as the number of selected variables.
  • The quadratic penalty term makes the loss function strictly convex, and it therefore has a unique minimum.

On sklearn, if you are interested in controlling the L1 and L2 penalty separately, keep in mind that this is equivalent to::

    a * L1 + b * L2    where:  alpha = a + b and l1_ratio = a / (a + b)

Its most important parameters are:

  • alpha: Constant that multiplies the penalty terms. Defaults to 1.0. See the notes for the exact mathematical meaning of this parameter.alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Given this, you should use the LinearRegression object.
  • l1_ratio: The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
  • max_iter: The maximum number of iterations
  • selection: If set to 'random', a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to 'random') often leads to significantly faster convergence especially when tol is higher than 1e-4.
  • tol: The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.
In [109]:
model = Pipeline([
        ('pca', PCA(random_state = 101)),
        ('model', ElasticNet(random_state=101))])

SEL = list(set(RFEcv).union(set(colsP)))
n_components = [len(SEL)-5, len(SEL)-3, len(SEL)] 
whiten = [False] #, True]
max_iter = [5] #, 100] 
alpha = [1e-05] #, 0.001, 0.01, 0.003, 0.00001] 
l1_ratio =  [0.00003] 
selection = ['cyclic'] #, 'random', 'cyclic']

param_grid =\
            dict(
                  model__max_iter= max_iter
                  ,pca__n_components = n_components
                  ,pca__whiten = whiten 
                  ,model__alpha = alpha
                  ,model__l1_ratio = l1_ratio
                  ,model__selection = selection
               ) 

gs = GridSearchCV(estimator = model, param_grid = param_grid, refit = 'neg_mean_squared_error'
                   , scoring=list(['neg_mean_squared_error' , 'neg_mean_absolute_error', 'r2']) 
                   ,cv=5, verbose=1, n_jobs=4)
 
ELA = Pipeline([
        ('sel', select_fetaures(select_cols=SEL)),
        ('scl', RobustScaler()),
        ('gs', gs)
 ])

ELA.fit(train, y_train)

res = get_results(ELA, 'ELA', log=False)
resilduals_plots(ELA, train, y_train, log=False)
results = pd.concat([results, res], axis=0)
res.loc[:, 'Scorer' : 'MeanScoreStd']
Recive 187 features...
Select 109 features
Fitting 5 folds for each of 3 candidates, totalling 15 fits
[Parallel(n_jobs=4)]: Done  15 out of  15 | elapsed:    4.1s finished
Best Score: 21395.037433
---------------------------------------
Best Parameters:
{'model__alpha': 1e-05, 'model__l1_ratio': 3e-05, 'model__max_iter': 5, 'model__selection': 'cyclic', 'pca__n_components': 109, 'pca__whiten': False}
Select 109 features
Out[109]:
Scorer Index BestScore BestScoreStd MeanScore MeanScoreStd
0 MEA 2 14545.115 855.166 14550.063 858.516
0 R2 2 92.673 0.628 92.669 0.633
0 RMSE 2 21395.037 7174.726 21400.305 7196.705

This model looks promising and have good computational performance. Again we use PCA for performance improvement not for dimension reduction.

Bayesian Ridge Regression

Ridge regression is an L2 penalized model where we simply add the squared sum of the weights to our least-squares cost function: image By increasing the value of the hyper parameter λ , we increase the regularization strength and shrink the weights of our model. Please note that we don't regularize the intercept term ß0,/sub>.

Bayesian ridge regression fit a Bayesian ridge model and optimize the regularization parameters lambda (precision of the weights) and alpha (precision of the noise).

The advantages of Bayesian Regression are:

  • It adapts to the data at hand.
  • It can be used to include regularization parameters in the estimation procedure. The disadvantages of Bayesian regression include:
  • Inference of the model can be time consuming.

The main parameter on sklearn are:

  • n_iter: Maximum number of iterations. Default is 300.
  • tol: Stop the algorithm if w has converged. Default is 1.e-3.
  • alpha_1: Hyper-parameter : shape parameter for the Gamma distribution prior over the alpha parameter. Default is 1.e-6
  • alpha_2: Hyper-parameter : inverse scale parameter (rate parameter) for the Gamma distribution prior over the alpha parameter. Default is 1.e-6.
  • lambda_1: Hyper-parameter : shape parameter for the Gamma distribution prior over the lambda parameter. Default is 1.e-6.
  • lambda_2: parameter (rate parameter) for the Gamma distribution prior over the lambda parameter. Default is 1.e-6
In [110]:
model = Pipeline([
        ('pca', PCA(random_state = 101)),
        ('model', BayesianRidge())]) #compute_score=False, fit_intercept=True, normalize=False

SEL = list(set(RFEcv).union(set(colsP)))
n_components = [len(SEL)-9] #, len(SEL)-8, len(SEL)-7] 
whiten = [True] # , False]
n_iter=  [36] # np.arange(36, 45) # [40, 35, 45, 70, 100, 200, 300, 500, 700, 1000] #  
alpha_1 = [1e-06] #0.1, 1e-04, 1e-03, 1e-02, 1e-05]
alpha_2 = [0.1] # 1e-06 , , 1e-02, 1e-04, 1e-03]
lambda_1 = [0.001] # 0.1, 1e-06, 1e-04, 1e-02, 1e-05] 
lambda_2 = [0.01] # 0.1, 1e-06, 1e-04, 1e-03, 1e-05]

param_grid =\
            dict(
                   model__n_iter = n_iter
                  ,model__alpha_1 = alpha_1
                  ,model__alpha_2 = alpha_2
                  ,model__lambda_1 = lambda_1
                  ,model__lambda_2 = lambda_2
                  ,pca__n_components = n_components
                  ,pca__whiten = whiten 
              ) 

gs = GridSearchCV(estimator = model, param_grid = param_grid, refit = 'neg_mean_squared_error'
                   , scoring=list(['neg_mean_squared_error' , 'neg_mean_absolute_error', 'r2']) 
                   ,cv=5, verbose=1, n_jobs=4)
 
BayR = Pipeline([
        ('sel', select_fetaures(select_cols=SEL)),
        ('scl', RobustScaler()),
        ('gs', gs)
 ])

BayR.fit(train, y_train)
res = get_results(BayR, 'BayR', log=False)
resilduals_plots(BayR, train, y_train, log=False)
results = pd.concat([results, res], axis=0)
res.loc[:, 'Scorer' : 'MeanScoreStd']
Recive 187 features...
Select 109 features
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    4.2s finished
Best Score: 21390.978673
---------------------------------------
Best Parameters:
{'model__alpha_1': 1e-06, 'model__alpha_2': 0.1, 'model__lambda_1': 0.001, 'model__lambda_2': 0.01, 'model__n_iter': 36, 'pca__n_components': 100, 'pca__whiten': True}
Select 109 features
Out[110]:
Scorer Index BestScore BestScoreStd MeanScore MeanScoreStd
0 MEA 0 14541.231 843.233 14541.231 843.233
0 R2 0 92.667 0.668 92.667 0.668
0 RMSE 0 21390.979 6975.660 21390.979 6975.660

Other good model and without linear pattern on the residuals.

Linear Regression

In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being predicted) in the given dataset and those predicted by the linear function.

On sklearn the ordinary least squares are implemented as LinearRegression, and its main important parameters are:

  • fit_intercept: whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (e.g. data is expected to be already centered).
  • normalize: This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use sklearn.preprocessing.StandardScaler before calling fit on an estimator with normalize=False.
In [111]:
model = Pipeline([
        ('pca', PCA(random_state = 101)),
        ('model', LinearRegression())])

SEL = list(set(RFEcv).union(set(colsP)))
n_components = [len(SEL)-10, len(SEL)-11, len(SEL)-9] 
whiten = [True, False]

param_grid =\
            dict(
                   pca__n_components = n_components,
                   pca__whiten = whiten
               ) 

gs = GridSearchCV(estimator = model, param_grid = param_grid, refit = 'neg_mean_squared_error'
                   , scoring=list(['neg_mean_squared_error' , 'neg_mean_absolute_error', 'r2']) 
                   ,cv=5, verbose=1, n_jobs=4)
 
LR = Pipeline([
        ('sel', select_fetaures(select_cols=SEL)),
        ('scl', RobustScaler()),
        ('gs', gs)
 ])

LR.fit(train, y_train)

res = get_results(LR, 'LR', log=False)
resilduals_plots(LR, train, y_train, log=False)
results = pd.concat([results, res], axis=0)
res.loc[:, 'Scorer' : 'MeanScoreStd']
Recive 187 features...
Select 109 features
Fitting 5 folds for each of 6 candidates, totalling 30 fits
[Parallel(n_jobs=4)]: Done  30 out of  30 | elapsed:    4.6s finished
Best Score: 21396.015658
---------------------------------------
Best Parameters:
{'pca__n_components': 100, 'pca__whiten': False}
Select 109 features
Out[111]:
Scorer Index BestScore BestScoreStd MeanScore MeanScoreStd
0 MEA 5 14565.345 858.002 14609.099 854.989
0 R2 4 92.663 0.691 92.634 0.674
0 RMSE 5 21396.016 7012.141 21445.358 7161.375

As you can see, this model performs very well. This is expected, since we work on eliminating the problems of collinearity, multicollinearity and maximization of significance. Of course, when we do this, we take care of the properties required by linear regressions, and try give some flexibility to the model to identify other patterns by the inclusion of some polynomials.

Orthogonal Matching Pursuit model (OMP)

OMP is based on a greedy algorithm that includes at each step the atom most highly correlated with the current residual. It is similar to the simpler matching pursuit (MP) method, but better in that at each iteration, the residual is recomputed using an orthogonal projection on the space of the previously chosen dictionary elements.

On skelearn the main parameters are: n_nonzero_coefs: Desired number of non-zero entries in the solution. If None (by default) this value is set to 10% of n_features. tol: Maximum norm of the residual. If not None, overrides n_nonzero_coefs.

In [112]:
model = Pipeline([
        ('pca', PCA(random_state = 101)),
        ('model', OrthogonalMatchingPursuit())])

SEL = list(set(RFEcv).union(set(colsP)))
n_components = [len(SEL)-11, len(SEL)-10, len(SEL)-9] 
whiten = [True, False]
tol = [None, 0.00005, 0.0001, 0.00000, 0.002]

param_grid =\
            dict(
                   model__tol = tol
                   ,model__n_nonzero_coefs = range(2, 6) # [10, 20, 30, 40, 50, 60, 70, 80, 90, None] # 
                   ,pca__n_components = n_components
                   ,pca__whiten = whiten
                   ) 

gs = GridSearchCV(estimator = model, param_grid = param_grid, refit = 'neg_mean_squared_error'
                   , scoring=list(['neg_mean_squared_error' , 'neg_mean_absolute_error', 'r2']) 
                   ,cv=5, verbose=1, n_jobs=4)
 
ORT = Pipeline([
        ('sel', select_fetaures(select_cols=SEL)),
        ('scl', RobustScaler()),
        ('gs', gs)
 ])

ORT.fit(train, y_train)
res = get_results(ORT, 'ORT', log=False)
resilduals_plots(ORT, train, y_train, log=False)
results = pd.concat([results, res], axis=0)
res.loc[:, 'Scorer' : 'MeanScoreStd']
Recive 187 features...
Select 109 features
Fitting 5 folds for each of 120 candidates, totalling 600 fits
[Parallel(n_jobs=4)]: Done  50 tasks      | elapsed:    4.9s
[Parallel(n_jobs=4)]: Done 350 tasks      | elapsed:   10.9s
[Parallel(n_jobs=4)]: Done 600 out of 600 | elapsed:   16.1s finished
Best Score: 21396.015658
---------------------------------------
Best Parameters:
{'model__n_nonzero_coefs': 2, 'model__tol': 5e-05, 'pca__n_components': 100, 'pca__whiten': False}
Select 109 features
Out[112]:
Scorer Index BestScore BestScoreStd MeanScore MeanScoreStd
0 MEA 11 14565.345 858.002 15835.576 807.877
0 R2 10 92.663 0.691 91.109 0.925
0 RMSE 11 21396.016 7012.141 23584.400 8725.450

As can be seen, we have achieved that this model performance equated to LR and close to ELA and Lasso.

Robust Regressor

In robust statistics, robust regression is a form of regression analysis designed to overcome some limitations of traditional parametric and non-parametric methods. Regression analysis seeks to find the relationship between one or more independent variables and a dependent variable. Certain widely used methods of regression, such as ordinary least squares, have favourable properties if their underlying assumptions are true, but can give misleading results if those assumptions are not true; thus ordinary least squares is said to be not robust to violations of its assumptions. Robust regression methods are designed to be not overly affected by violations of assumptions by the underlying data-generating process.

In particular, least squares estimates for regression models are highly sensitive to (i.e. not robust against) outliers. While there is no precise definition of an outlier, outliers are observations which do not follow the pattern of the other observations. This is not normally a problem if the outlier is simply an extreme observation drawn from the tail of a normal distribution, but if the outlier results from non-normal measurement error or some other violation of standard ordinary least squares assumptions, then it compromises the validity of the regression results if a non-robust regression technique is used.

From the sklearn the HuberRegressor implements a robust regressor strategy. The HuberRegressor is different to Ridge because it applies a linear loss to samples that are classified as outliers. A sample is classified as an inlier if the absolute error of that sample is lesser than a certain threshold. It differs from TheilSenRegressor and RANSACRegressor because it does not ignore the effect of the outliers but gives a lesser weight to them.

Its main parameters are:

  • epsilon: The parameter epsilon controls the number of samples that should be classified as outliers. The smaller the epsilon, the more robust it is to outliers.
  • max_iter: Maximum number of iterations that scipy.optimize.fmin_l_bfgs_b should run for.
  • alpha: Regularization parameter.
  • tol: The iteration will stop when max{|proj g_i | i = 1, ..., n} <= tol where pg_i is the i-th component of the projected gradient.
In [113]:
model = Pipeline([
        ('pca', PCA(random_state = 101)),
        ('model', HuberRegressor())])

SEL = list(set(RFEcv).union(set(colsP)))
n_components = [len(SEL)-9] #, len(SEL)-8, len(SEL)-7, len(SEL)-1] 
whiten = [True] #, False]
max_iter = [2000] 
alpha = [0.0001] #, 5e-05, 0.01, 0.00005, 0.0005, 0.5, 0.001] 
epsilon = [1.005] #, 1.05, 1.01, 1.001] 
tol = [1e-01, 1e-02] #, 2e-01, 3e-01, 4e-01, 5e-01, 6e-01] 

param_grid =\
            dict(
                  model__max_iter= max_iter
                  ,pca__n_components = n_components
                  ,pca__whiten = whiten 
                  ,model__alpha = alpha
                  ,model__epsilon = epsilon
                  ,model__tol = tol
               ) 

gs = GridSearchCV(estimator = model, param_grid = param_grid, refit = 'neg_mean_squared_error'
                   , scoring=list(['neg_mean_squared_error' , 'neg_mean_absolute_error', 'r2']) 
                   ,cv=5, verbose=1, n_jobs=3)
 
Hub = Pipeline([
        ('sel', select_fetaures(select_cols=SEL)),
        ('scl', RobustScaler()),
        ('gs', gs)
 ])

Hub.fit(train, y_train)

res = get_results(Hub, 'Hub', log=False)
resilduals_plots(Hub, train, y_train, log=False)
results = pd.concat([results, res], axis=0)
res.loc[:, 'Scorer' : 'MeanScoreStd']
Recive 187 features...
Select 109 features
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[Parallel(n_jobs=3)]: Done  10 out of  10 | elapsed:    4.9s finished
Best Score: 21321.020939
---------------------------------------
Best Parameters:
{'model__alpha': 0.0001, 'model__epsilon': 1.005, 'model__max_iter': 2000, 'model__tol': 0.1, 'pca__n_components': 100, 'pca__whiten': True}
Select 109 features
Out[113]:
Scorer Index BestScore BestScoreStd MeanScore MeanScoreStd
0 MEA 0 14089.643 614.618 14089.643 614.618
0 R2 0 92.722 0.690 92.722 0.690
0 RMSE 0 21321.021 7498.064 21321.021 7498.064

This model had a good performance, but some linear pattern of escape given the most scattered points on the top right.

Bayesian ARD Regression

In statistics, Bayesian linear regression is an approach to linear regression in which the statistical analysis is undertaken within the context of Bayesian inference. When the regression model has errors that have a normal distribution, and if a particular form of prior distribution is assumed, explicit results are available for the posterior probability distributions of the model's parameters.

ARDRegression is very similar to Bayesian Ridge Regression, but can lead to sparser weights. ARDRegression poses a different prior over, by dropping the assumption of the Gaussian being spherical.

Instead, the distribution over is assumed to be an axis-parallel, elliptical Gaussian distribution. Fit the weights of a regression model, using an ARD prior. The weights of the regression model are assumed to be in Gaussian distributions. Also estimate the parameters lambda (precisions of the distributions of the weights) and alpha (precision of the distribution of the noise). The estimation is done by an iterative procedures.

The main parameters are:

  • n_iter : Maximum number of iterations. Default is 300
  • tol: Stop the algorithm if w has converged. Default is 1.e-3.
  • alpha_1: Hyper-parameter : shape parameter for the Gamma distribution prior over the alpha parameter. Default is 1.e-6.
  • alpha_2: Hyper-parameter : inverse scale parameter (rate parameter) for the Gamma distribution prior over the alpha parameter. Default is 1.e-6.
  • lambda_1: Hyper-parameter : shape parameter for the Gamma distribution prior over the lambda parameter. Default is 1.e-6.
  • lambda_2: Hyper-parameter : inverse scale parameter (rate parameter) for the Gamma distribution prior over the lambda parameter. Default is 1.e-6.
In [114]:
model = Pipeline([
        ('pca', PCA(random_state = 101)),
        ('model', ARDRegression())])

SEL = list(set(RFEcv).union(set(colsP)))
n_components = [len(SEL)-9, len(SEL)-8, len(SEL)-7, len(SEL)-1] 
whiten = [True] #, False]
n_iter=  [500] 
alpha_1 = [1e-06] #, 0.001, 1e-05, 1e-05, 1e-03]
alpha_2 = [1e-01] #, 1e-2, 1e-06]#, 1e-04, 1e-03]
lambda_1 = [1e-6]
lambda_2 = [0.1]

param_grid =\
            dict(
                  #pca__n_components = n_components,
                  pca__whiten = whiten, 
                  model__n_iter = n_iter
                  ,model__alpha_1 = alpha_1
                  ,model__alpha_2 = alpha_2
                  ,model__lambda_1 = lambda_1
                  ,model__lambda_2 = lambda_2
               ) 

gs = GridSearchCV(estimator = model, param_grid = param_grid, refit = 'neg_mean_squared_error'
                   , scoring=list(['neg_mean_squared_error' , 'neg_mean_absolute_error', 'r2']) 
                   ,cv=5, verbose=1, n_jobs=3)
 
ARDR = Pipeline([
        ('sel', select_fetaures(select_cols=SEL)),
        ('scl', RobustScaler()),
        ('gs', gs)
 ])

ARDR.fit(train, y_train)
res = get_results(ARDR, 'ARDR', log=False)
resilduals_plots(ARDR, train, y_train, log=False)
results = pd.concat([results, res], axis=0)
res.loc[:, 'Scorer' : 'MeanScoreStd']
Recive 187 features...
Select 109 features
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed: 40.4min finished
Best Score: 21690.137677
---------------------------------------
Best Parameters:
{'model__alpha_1': 1e-06, 'model__alpha_2': 0.1, 'model__lambda_1': 1e-06, 'model__lambda_2': 0.1, 'model__n_iter': 500, 'pca__whiten': True}
Select 109 features
Out[114]:
Scorer Index BestScore BestScoreStd MeanScore MeanScoreStd
0 MEA 1 14612.132 675.584 14612.203 675.560
0 R2 1 92.471 0.600 92.471 0.600
0 RMSE 1 21690.138 7213.693 21690.182 7213.700

Passive Aggressive Regressor

The passive-aggressive algorithms are a family of algorithms for large-scale learning. They are similar to the Perceptron in that they do not require a learning rate. However, contrary to the Perceptron, they include a regularization parameter C.

For regression, PassiveAggressiveRegressor can be used with loss='epsilon_insensitive' (PA-I) or loss='squared_epsilon_insensitive' (PA-II).

The main parameters are:

  • C: Maximum step size (regularization). Defaults to 1.0.
  • max_iter: The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the fit method, and not the partial_fit. Defaults to 1000 from 0.21 version, or if tol is not None.
  • tol: The stopping criterion. If it is not None, the iterations will stop when (loss > previous_loss - tol). Defaults to 1e-3 from 0.21 version.
  • epsilon: If the difference between the current prediction and the correct label is below this threshold, the model is not updated.
  • n_iter_no_change: Number of iterations with no improvement to wait before early stopping. Default=5
In [115]:
model = Pipeline([
        ('pca', PCA(random_state = 101)),
        ('model', PassiveAggressiveRegressor(random_state = 101))])

SEL = list(set(RFEcv).union(set(colsP)))
n_components = [len(SEL)-9, len(SEL)-8, len(SEL)-7, len(SEL)-1] 
whiten = [True] #, False]
loss = ['squared_epsilon_insensitive'] #, 'epsilon_insensitive']
C = [0.001] #, 0.005, 0.003]
max_iter = [1000] 
epsilon = [0.00001] # , 0.00005
tol = [1e-03] #, 1e-05,1e-02, 1e-01, 1e-04, 1e-06]

param_grid =\
            dict(
                  pca__n_components = n_components,
                  pca__whiten = whiten, 
                  model__loss = loss
                  ,model__epsilon = epsilon
                  ,model__C = C
                  ,model__tol = tol
                  ,model__max_iter = max_iter
               ) 
 
gs = GridSearchCV(estimator = model, param_grid = param_grid, refit = 'neg_mean_squared_error'
                   , scoring=list(['neg_mean_squared_error' , 'neg_mean_absolute_error', 'r2']) 
                   ,cv=5, verbose=1, n_jobs=4)
 
PassR = Pipeline([
        ('sel', select_fetaures(select_cols=SEL)),
        ('scl', RobustScaler()),
        ('gs', gs)
 ])

PassR.fit(train, y_train)
res = get_results(PassR, 'PassR', log=False)
resilduals_plots(PassR, train, y_train, log=False)
results = pd.concat([results, res], axis=0)
res.loc[:, 'Scorer' : 'MeanScoreStd']
Recive 187 features...
Select 109 features
Fitting 5 folds for each of 4 candidates, totalling 20 fits
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   10.4s finished
Best Score: 21424.971986
---------------------------------------
Best Parameters:
{'model__C': 0.001, 'model__epsilon': 1e-05, 'model__loss': 'squared_epsilon_insensitive', 'model__max_iter': 1000, 'model__tol': 0.001, 'pca__n_components': 100, 'pca__whiten': True}
Select 109 features
Out[115]:
Scorer Index BestScore BestScoreStd MeanScore MeanScoreStd
0 MEA 0 14675.880 853.296 14732.995 835.723
0 R2 0 92.646 0.716 92.572 0.651
0 RMSE 0 21424.972 7066.795 21539.220 7050.311

SGD Regressor

Linear model fitted by minimizing a regularized empirical loss with SGD

Stochastic gradient descent (often shortened to SGD), also known as incremental gradient descent, is an iterative method for optimizing a differentiable objective function, a stochastic approximation of gradient descent optimization. The gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate).

The regularizer is a penalty added to the loss function that shrinks model parameters towards the zero vector using either the squared euclidean norm L2 or the absolute norm L1 or a combination of both (Elastic Net). If the parameter update crosses the 0.0 value because of the regularizer, the update is truncated to 0.0 to allow for learning sparse models and achieve online feature selection.

This implementation works with data represented as dense numpy arrays of floating point values for the features.

The main parameters are:

  • loss: The loss function to be used. The 'squared_loss' refers to the ordinary least squares fit. 'huber' modifies 'squared_loss' to focus less on getting outliers correct by switching from squared to linear loss past a distance of epsilon. 'epsilon_insensitive' ignores errors less than epsilon and is linear past that; this is the loss function used in SVR. 'squared_epsilon_insensitive' is the same but becomes squared loss past a tolerance of epsilon.
  • penalty: The penalty (aka regularization term) to be used. Defaults to 'l2' which is the standard regularizer for linear SVM models. 'l1' and 'elasticnet' might bring sparsity to the model (feature selection) not achievable with 'l2'.
  • alpha: Constant that multiplies the regularization term. Defaults to 0.0001 Also used to compute learning_rate when set to 'optimal'.
  • l1_ratio: The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1. l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1. Defaults to 0.15.
  • max_iter: The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the fit method, and not the partial_fit. Defaults to 1000 from 0.21 version, or if tol is not None.
  • tol: The stopping criterion. If it is not None, the iterations will stop when (loss > previous_loss - tol). Defaults to 1e-3 from 0.21 version.
  • epsilon: Epsilon in the epsilon-insensitive loss functions; only if loss is 'huber', 'epsilon_insensitive', or 'squared_epsilon_insensitive'. For 'huber', determines the threshold at which it becomes less important to get the prediction exactly right. For epsilon-insensitive, any differences between the current prediction and the correct label are ignored if they are less than this threshold.
  • eta0: The initial learning rate for the 'constant', 'invscaling' or 'adaptive' schedules. The default value is 0.0 as eta0 is not used by the default schedule 'optimal'.
  • power_t:The exponent for inverse scaling learning rate. Default 0.5.
  • learning_rate: The learning rate schedule:
    • 'constant': eta = eta0
    • 'optimal': eta = 1.0 / (alpha * (t + t0)) where t0 is chosen by a heuristic proposed by Leon Bottou.
    • 'invscaling': eta = eta0 / pow(t, power_t)
    • 'adaptive': eta = eta0, as long as the training keeps decreasing. Each time n_iter_no_change consecutive epochs fail to decrease the training loss by tol or fail to increase validation score by tol if early_stopping is True, the current learning rate is divided by 5.
In [116]:
model = Pipeline([
        ('pca', PCA(random_state = 101)),
        ('model', SGDRegressor(random_state = 101))])

SEL = list(set(RFEcv).union(set(colsP)))
n_components = [len(SEL)-9, len(SEL)-8, len(SEL)-7, len(SEL)-1] 
whiten = [True] #, False]
loss = ['squared_loss'] #, 'huber', 'squared_epsilon_insensitive', 'epsilon_insensitive']
penalty = ['l2'] #, 'elasticnet', 'l1']
l1_ratio = [0.7] #, 0.8] #[0.2, 0.5, 0.03]
learning_rate = ['invscaling'] #, 'constant', 'optimal']
alpha = [0.001] # [1e-01, 1e-2, 1e-03, 1e-4, 1e-05]
epsilon =  [1e-01] #, 1e-2, 1e-03, 1e-4, 1e-05]
tol = [0.001] #, 0.003] 
eta0 = [0.01] #, 1e-1, 1e-03, 1e-4, 1e-05] 
power_t = [0.5]
 
param_grid =\
            dict(
                   pca__n_components = n_components
                   ,pca__whiten = whiten, 
                   model__penalty = penalty
                   ,model__l1_ratio = l1_ratio
                   ,model__loss = loss
                   ,model__alpha = alpha
                   ,model__epsilon = epsilon
                   ,model__tol = tol
                   ,model__eta0 = eta0
                   ,model__power_t = power_t
                   ,model__learning_rate = learning_rate
               ) 

gs = GridSearchCV(estimator = model, param_grid = param_grid, refit = 'neg_mean_squared_error'
                   , scoring=list(['neg_mean_squared_error' , 'neg_mean_absolute_error', 'r2']) 
                   ,cv=5, verbose=1, n_jobs=4)
 
SGDR = Pipeline([
        ('sel', select_fetaures(select_cols=SEL)),
        ('scl', RobustScaler()),
        ('gs', gs)
 ])

SGDR.fit(train, y_train)
res = get_results(SGDR, 'SGDR', log=False)
resilduals_plots(SGDR, train, y_train, log=False)
results = pd.concat([results, res], axis=0)
res.loc[:, 'Scorer' : 'MeanScoreStd']
Recive 187 features...
Select 109 features
Fitting 5 folds for each of 4 candidates, totalling 20 fits
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:    3.8s finished
Best Score: 21393.907877
---------------------------------------
Best Parameters:
{'model__alpha': 0.001, 'model__epsilon': 0.1, 'model__eta0': 0.01, 'model__l1_ratio': 0.7, 'model__learning_rate': 'invscaling', 'model__loss': 'squared_loss', 'model__penalty': 'l2', 'model__power_t': 0.5, 'model__tol': 0.001, 'pca__n_components': 100, 'pca__whiten': True}
Select 109 features
Out[116]:
Scorer Index BestScore BestScoreStd MeanScore MeanScoreStd
0 MEA 0 14560.532 853.832 14619.577 871.234
0 R2 0 92.665 0.685 92.580 0.597
0 RMSE 0 21393.908 7005.896 21526.560 6971.819

Light GBM Regressor

LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

  • Faster training speed and higher efficiency
  • Lower memory usage
  • Better accuracy
  • Parallel and GPU learning supported
  • Capable of handling large-scale data

Main Parameters:

  • boosting_type: 'gbdt', traditional Gradient Boosting Decision Tree. 'dart', Dropouts meet Multiple Additive Regression Trees. 'goss', Gradient-based One-Side Sampling. 'rf', Random Forest.
  • num_leaves: Maximum tree leaves for base learners (default=31).
  • max_depth: Maximum tree depth for base learners, -1 means no limit (default=-1).
  • objective: Specify the learning task and the corresponding learning objective or a custom objective function to be used. Default: 'regression' for LGBMRegressor, 'binary' or 'multiclass' for LGBMClassifier, 'lambdarank' for LGBMRanker.
  • min_split_gain: Minimum loss reduction required to make a further partition on a leaf node of the tree (default=0.).
  • min_child_samples: Minimum number of data needed in a child (leaf) (default=20).
  • reg_alpha: L1 regularization term on weights.
  • reg_lambda: L2 regularization term on weights.
  • importance_type: The type of feature importance to be filled into featureimportances. If 'split', result contains numbers of times the feature is used in a model. If 'gain', result contains total gains of splits which use the feature.
In [117]:
model = Pipeline([
        ('pca', PCA(random_state = 101)),
        ('model', lgb.LGBMRegressor(objective='regression',  n_jobs = 4, feature_fraction_seed=9, bagging_seed=9, 
                                    random_state = 101))]) #  max_bin = 55, 

SEL = list(set(XGBestCols).union(set(colsP)))
n_components = [len(SEL)-9, len(SEL)-8, len(SEL)-7, len(SEL)-1] 
whiten = [True] #, False]
boosting_type = ['gbdt'] #, 'dart', 'rf']
max_depth = [3, 2, 4]
learning_rate= [0.05] #, 0.1, 0.01, 0.003]
n_estimators=[2000]
min_data_in_leaf =[5] 
num_leaves=[3] 
feature_fraction = [0.3]
bagging_fraction = [0.7]
bagging_freq = [5]
min_sum_hessian_in_leaf = [11]
reg_alpha = [1e-2]#, 1e-1, 1e-3, 1e-4, 1e-5, 1e-6]
reg_lambda = [1e-1]#, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6]
param_grid =\
            dict(
                   pca__n_components = n_components,
                   pca__whiten = whiten, 
                   model__boosting_type = boosting_type,
                   model__max_depth = max_depth,
                   model__learning_rate = learning_rate,
                   model__n_estimators = n_estimators,
                   model__min_data_in_leaf = min_data_in_leaf, 
                   model__num_leaves = num_leaves,
                   model__feature_fraction = feature_fraction,
                   model__bagging_fraction = bagging_fraction,
                   model__bagging_freq = bagging_freq, 
                   model__min_sum_hessian_in_leaf = min_sum_hessian_in_leaf,
                   model__reg_alpha = reg_alpha,
                   model__reg_lambda = reg_lambda
               ) 

gs = GridSearchCV(estimator = model, param_grid = param_grid, refit = 'neg_mean_squared_error'
                   , scoring=list(['neg_mean_squared_error' , 'neg_mean_absolute_error', 'r2']) 
                   ,cv=5, verbose=1, n_jobs=4)
 
LGBM = Pipeline([
        ('sel', select_fetaures(select_cols=SEL)),
        ('scl', RobustScaler()),
        ('gs', gs)
 ])

LGBM.fit(train, y_train)
res = get_results(LGBM, 'LGBM', log=False)
resilduals_plots(LGBM, train, y_train, log=False)
results = pd.concat([results, res], axis=0)
res.loc[:, 'Scorer' : 'MeanScoreStd']
Recive 187 features...
Select 79 features
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:   35.9s
[Parallel(n_jobs=4)]: Done  60 out of  60 | elapsed:   51.8s finished
Best Score: 23719.500045
---------------------------------------
Best Parameters:
{'model__bagging_fraction': 0.7, 'model__bagging_freq': 5, 'model__boosting_type': 'gbdt', 'model__feature_fraction': 0.3, 'model__learning_rate': 0.05, 'model__max_depth': 3, 'model__min_data_in_leaf': 5, 'model__min_sum_hessian_in_leaf': 11, 'model__n_estimators': 2000, 'model__num_leaves': 3, 'model__reg_alpha': 0.01, 'model__reg_lambda': 0.1, 'pca__n_components': 71, 'pca__whiten': True}
Select 79 features
Out[117]:
Scorer Index BestScore BestScoreStd MeanScore MeanScoreStd
0 MEA 2 15976.072 686.666 16163.546 566.702
0 R2 1 90.992 0.543 90.807 0.699
0 RMSE 1 23719.500 7199.583 23965.352 7872.495

Check the best results from the models hyper parametrization

In [118]:
results.loc[results.Scorer=='RMSE', ['Name','BestScore', 'BestScoreStd']].sort_values(by='BestScore', ascending=True)
Out[118]:
Name BestScore BestScoreStd
0 XGBRegressor 19959.960 6373.905
0 Hub 21321.021 7498.064
0 BayR 21390.979 6975.660
0 SGDR 21393.908 7005.896
0 ELA 21395.037 7174.726
0 LR 21396.016 7012.141
0 ORT 21396.016 7012.141
0 PassR 21424.972 7066.795
0 lasso 21504.641 7071.982
0 ARDR 21690.138 7213.693
0 LGBM 23719.500 7199.583
0 GBR 23873.765 8664.820

Stacking the Models

In [123]:
class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)

        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([model.predict(X) for model in self.models_])
        return np.mean(predictions, axis=1)   

#defining RMSLE evaluation function
def RMSLE (y, y_pred):
    return (np.sqrt(mean_squared_error(y, y_pred)))

# Averaged base models score
averaged_models = AveragingModels(models = (XGBR, BayR, PassR)) # Hub, ELA,  lasso, ARDR, LGBM, GBR

averaged_models.fit(train, y_train) 
stacked_train_pred = averaged_models.predict(train)

stacked_pred = (averaged_models.predict(test))
rmsle = RMSLE(y_train,stacked_train_pred)

print('RMSLE score on the train data: {:.4f}'.format(rmsle))

print('Accuracy score: {:.6%}'.format(averaged_models.score(train, y_train)))
Recive 187 features...
Select 109 features
Fitting 5 folds for each of 3 candidates, totalling 15 fits
[Parallel(n_jobs=4)]: Done  15 out of  15 | elapsed:   18.3s finished
Recive 187 features...
Select 109 features
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    2.7s finished
Recive 187 features...
Select 109 features
Fitting 5 folds for each of 4 candidates, totalling 20 fits
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:    2.8s finished
Select 109 features
Select 109 features
Select 109 features
Select 109 features
Select 109 features
Select 109 features
RMSLE score on the train data: 18933.2950
Select 109 features
Select 109 features
Select 109 features
Accuracy score: 94.314834%

Create Submission File:

In [124]:
# Preper Submission File
ensemble = stacked_pred *1
submit = pd.DataFrame()
submit['id'] = test_ID
submit['SalePrice'] = ensemble
# ----------------------------- Create File to Submit --------------------------------
submit.to_csv('SalePrice_N_submission.csv', index = False)
submit.head()
Out[124]:
id SalePrice
0 1461 121390.504
1 1462 161597.055
2 1463 181572.678
3 1464 198661.911
4 1465 198232.786

Conclusion

As we can see, through a method we were able to generate models that could present good generalization and better performance when combined.

The big challenge was the proper detection and decision-cutting of outliers, the reassessment of noise-generating features, and the combined combination of selection and data engineering strategies.

We can put into practice a great number of techniques and methods, from EDA to the generation of stacked models, covering a broad conceptual and practical expectation as desired.

If you run this kernel as it is and submit the file, you will be able to be among the top 34.7% with 0.12559 points, but as you noted, we had to make decisions that impacted the performance better or worse depending on the characteristics of each model . In this sense, I invite you to download this notebook and practice it, send your suggestions and comments, knowing that it is possible to be among the top 16%.

For next steps, I suggest:

  • Applies it to a deep learning model like TensorFlow
  • Try a Random Forest Regressor, with and without the transformations like box cox and without polynomials.
  • Try eliminate outliers only through the residuals
  • Try find a good model with Sales price on log1p transformation
  • Run some models without polynomials and try other combinations of polynomials, with more and less features
  • Apply BboxCox on others features, and try boxcox1p with different lambdas
  • Some categorical data has better performance if you ignore that it is categorical. Since this is a wrong approach, you need find than empirical and try give some meaning, if you really can.

Thank you for your attention so far, I also ask you to share and accept any comments and feedback.

My best regards and good luck.